From patchwork Thu Feb 26 11:05:25 2026 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Richard.Wild@arm.com" X-Patchwork-Id: 130681 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from vm01.sourceware.org (localhost [127.0.0.1]) by sourceware.org (Postfix) with ESMTP id CD5B44BA2E0B for ; Thu, 26 Feb 2026 11:08:04 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org CD5B44BA2E0B Authentication-Results: sourceware.org; dkim=pass (1024-bit key, unprotected) header.d=arm.com header.i=@arm.com header.a=rsa-sha256 header.s=selector1 header.b=OlSVTLG2; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.a=rsa-sha256 header.s=selector1 header.b=OlSVTLG2 X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from AM0PR83CU005.outbound.protection.outlook.com (mail-westeuropeazlp170100001.outbound.protection.outlook.com [IPv6:2a01:111:f403:c201::1]) by sourceware.org (Postfix) with ESMTPS id 9777D4BA23CA for ; Thu, 26 Feb 2026 11:06:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 9777D4BA23CA Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 9777D4BA23CA Authentication-Results: server2.sourceware.org; arc=pass smtp.remote-ip=2a01:111:f403:c201::1 ARC-Seal: i=3; a=rsa-sha256; d=sourceware.org; s=key; t=1772104009; cv=pass; b=xh/WDEtEBjX8Lf7FDJ9P2qDKNzmws23+ZsDvxlZ5hLxt0slOpYazuwYMqRF0KfU+XLsRGldgVjip3z5ByrukSB09K6KP9wlIeeeUYj4lww30t7lf6JA4CB0TdK10JTkokdFLw69K3NIdVPKNg6ag1ViIo6fQ9fLGezui5hNYeYE= ARC-Message-Signature: i=3; a=rsa-sha256; d=sourceware.org; s=key; t=1772104009; c=relaxed/simple; bh=K5y2kT6QyK6ajTOWIPf8bQoAuNl2AdHnW7unJq/UgsI=; h=DKIM-Signature:DKIM-Signature:From:To:Subject:Date:Message-ID: MIME-Version; b=vrvKczCb8CvpZoZ1nGuq9dikUIVZAVk63xhaEPnFPOqCiI710X/N7Rta2zWn0el4mYwOi/E0G6x0SGVolivcYSIGFFHLwjAL5x12dSZox2Cpk6RBqZiPqzEIQUDdLPtaShl5LeK/juj3xVBL09q3s2ub3NBq3h1TKcrPfhadrWg= ARC-Authentication-Results: i=3; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 9777D4BA23CA ARC-Seal: i=2; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=pass; b=mtN8tbWTlA+OHDRTiAj7FSvCk45+lW8LGuX2llhpH7NTiPIayV+1gj+ndv2WnRWlwJoyqnhTJocysmdnr7oOURafhVdlGbi+T+Z2wwidxo36JcnUBcwqwSXUUqmt1+wlqilTLm5W66nlFOsjUrhVkb4Gif+yH0FMSEc62l/zrEw493TbtwRAOi0qvdiXY9Sl1Lhz43km3PKRlaoJpFUuLQ8uvxMrsZM/BL6aBQA5sa6waw717gS7SrkzDXceDg4BdbipK0d2VFpQ48Dca2LTKQjBxXeRQvaGrSJGThtMtPYp7cfxkHTNUm9bmqYlEOOlWBaXe1xYusuja9RkgDYBDA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ySW8VJtz/Wnclxsxy7tb0a6+KD39mveH3Z8diQbsCLs=; b=DmaefJRbehTB4FaOKNgVEUohsHf9C3IjF0EXCxEQ47rQhiqqOcPVOEtJcFmr2onL23x2N9Q6JYLKHnYP1nw55Az/movcqEJsPfgc3CsSOfOCezAqtJzL86VV7oARAveSnNOGoGuBp4hoRbJsAgvudcgFGDm/1zFhksGivCvQwbz3KLMtsibCRzTxwYABvx1F8lYiM5ddLAycwGiZJHIyKHPQudERAYheqGmr9LXj4BEEWXuQANmSoszEG0B4jQPPF5GK/vR0FTjwYOb8PX4derEYFz+75ISBcIAXSPHXfzeMLMx0e/XZ3neuXOoXB3CRGcFD1e6UQ//StaK2s9gCSg== ARC-Authentication-Results: i=2; mx.microsoft.com 1; spf=pass (sender ip is 4.158.2.129) smtp.rcpttodomain=sourceware.org smtp.mailfrom=arm.com; dmarc=pass (p=none sp=none pct=100) action=none header.from=arm.com; dkim=pass (signature was verified) header.d=arm.com; arc=pass (0 oda=1 ltdi=1 spf=[1,1,smtp.mailfrom=arm.com] dmarc=[1,1,header.from=arm.com]) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arm.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ySW8VJtz/Wnclxsxy7tb0a6+KD39mveH3Z8diQbsCLs=; b=OlSVTLG2x2UsQUBJ3jGvVclO1KYbFoe/CiaZWaTx0M170mXssKrPo4sg3BiXjHlmHGgQgJysWukiePVxbi79vgeTpIOlw28PSLIVIN+nANWi1WUiWqnxmd8kd/VRbyuOTH1tSJg9MpPJs+Fv4dmIutoQ2/6xIkIsj4W65ph3bxI= Received: from AS4P189CA0024.EURP189.PROD.OUTLOOK.COM (2603:10a6:20b:5db::14) by AS2PR08MB9414.eurprd08.prod.outlook.com (2603:10a6:20b:596::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9632.22; Thu, 26 Feb 2026 11:06:39 +0000 Received: from AMS0EPF000001A8.eurprd05.prod.outlook.com (2603:10a6:20b:5db:cafe::6f) by AS4P189CA0024.outlook.office365.com (2603:10a6:20b:5db::14) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9632.25 via Frontend Transport; Thu, 26 Feb 2026 11:06:34 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 4.158.2.129) smtp.mailfrom=arm.com; dkim=pass (signature was verified) header.d=arm.com;dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 4.158.2.129 as permitted sender) receiver=protection.outlook.com; client-ip=4.158.2.129; helo=outbound-uk1.az.dlp.m.darktrace.com; pr=C Received: from outbound-uk1.az.dlp.m.darktrace.com (4.158.2.129) by AMS0EPF000001A8.mail.protection.outlook.com (10.167.16.148) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9632.12 via Frontend Transport; Thu, 26 Feb 2026 11:06:39 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Le9XCtDntKKDxudQ9bCSStEmU8w9flhhJ9DviaUixnYbECieIUl5bLR0Qp2tIFfAsFhR4Le8wNMRuVbr5v6iSWalsbX4yy0pKpkukKRgpMtjOoeoN4QA/9PT19tyN4AA8FyS5rDLPHapq9A2q+UwG5jP4pha+PRqK2vqytEzOyllkqjWFTqhskP821L/UPVf6WY9owXEG135sSrtEjnM7TpJGRJvwSYEvA+KGCr12pgcByNslEGg3jixpSoqMTh3goFOYB2qm4U0ObjZAa5MoOd85wlnu3t93418cv0NKq4bn7FyQlkM2V7iUBK30kSO31wMIZjEedKG12dQ4yu9Dg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ySW8VJtz/Wnclxsxy7tb0a6+KD39mveH3Z8diQbsCLs=; b=HWv+b+FamLeWeuta64AERkx85+hCdcvZfKbPlkkH+ZCbv0Zx05Pk7584GimqcWQv2dUDjWRkrv05N6rp/RTL8AangoN84n5vKI7/zNEUD2eJ8/804nMuUka39jza3j5iEVXgZskkkM3hXPCgp9BpgrCefdZk2I2/IEOg68zbUbHoZ8qFQqZqnFT44GWO9/dsxMaoYy9Wl8K0tJ5CEbxIULCbB4Ek5gAsehG8/RhaMcrsjv9AY0Bl5//vpv6n6wysa9qvanjwSuXgcIks3eqnPZjT3cJvknY3+4LXxb18b5E4WOkkG9ECQb9NN37dkIU/vjqnQRX2NK5xD8IU1OpcRA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 172.205.89.229) smtp.rcpttodomain=sourceware.org smtp.mailfrom=arm.com; dmarc=pass (p=none sp=none pct=100) action=none header.from=arm.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arm.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ySW8VJtz/Wnclxsxy7tb0a6+KD39mveH3Z8diQbsCLs=; b=OlSVTLG2x2UsQUBJ3jGvVclO1KYbFoe/CiaZWaTx0M170mXssKrPo4sg3BiXjHlmHGgQgJysWukiePVxbi79vgeTpIOlw28PSLIVIN+nANWi1WUiWqnxmd8kd/VRbyuOTH1tSJg9MpPJs+Fv4dmIutoQ2/6xIkIsj4W65ph3bxI= Received: from CWLP123CA0115.GBRP123.PROD.OUTLOOK.COM (2603:10a6:401:5f::31) by PAVPR08MB9601.eurprd08.prod.outlook.com (2603:10a6:102:311::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9632.23; Thu, 26 Feb 2026 11:05:28 +0000 Received: from AMS0EPF00000192.eurprd05.prod.outlook.com (2603:10a6:401:5f:cafe::37) by CWLP123CA0115.outlook.office365.com (2603:10a6:401:5f::31) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9632.26 via Frontend Transport; Thu, 26 Feb 2026 11:05:31 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 172.205.89.229) smtp.mailfrom=arm.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 172.205.89.229 as permitted sender) receiver=protection.outlook.com; client-ip=172.205.89.229; helo=nebula.arm.com; pr=C Received: from nebula.arm.com (172.205.89.229) by AMS0EPF00000192.mail.protection.outlook.com (10.167.16.218) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9632.12 via Frontend Transport; Thu, 26 Feb 2026 11:05:28 +0000 Received: from AZ-NEU-EX03.Arm.com (10.240.25.137) by AZ-NEU-EX03.Arm.com (10.240.25.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.29; Thu, 26 Feb 2026 11:05:27 +0000 Received: from ip-10-249-51-10.eu-west-1.compute.internal (10.252.0.220) by mail.arm.com (10.240.25.137) with Microsoft SMTP Server id 15.2.2562.29 via Frontend Transport; Thu, 26 Feb 2026 11:05:27 +0000 From: "Richard.Wild@arm.com" To: CC: "Richard.Wild@arm.com" Subject: [PATCH] AArch64: Single and Double precision entire exp family, SVE and AdvSIMD optimisations Date: Thu, 26 Feb 2026 11:05:25 +0000 Message-ID: <20260226110525.12040-1-richard.wild@arm.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-EOPAttributedMessage: 1 X-MS-TrafficTypeDiagnostic: AMS0EPF00000192:EE_|PAVPR08MB9601:EE_|AMS0EPF000001A8:EE_|AS2PR08MB9414:EE_ X-MS-Office365-Filtering-Correlation-Id: 5ee98ee4-3d04-4ed9-5f6e-08de75271d89 x-checkrecipientrouted: true NoDisclaimer: true X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam-Untrusted: BCL:0; ARA:13230040|376014|82310400026|1800799024|36860700013|13003099007; X-Microsoft-Antispam-Message-Info-Original: Se2z1ZofmCqq7KQJxALJkVZm0YK4koy+x2/DNxN9RshK5TqZ6jFCFKKH1SY9EIYtzcDVIC53g6SCp2xl7Uw22XnkQ4XJcXLE/+WFdzyT/T+0NIsjXxprDfO3hP4RhhLtS2WQJJavm8h6pKgTHUGNv5aGc//bqh6oqznYr9FpKfXg9eJ42bgFeRiElA6aUOI/+0PluD9p7MWJjXsmpFSc3v49kqpFaNQMdyaFqhpzEzBOVXmP15OTVC0UrbOVf4oE8O+MO+FQcucrkvzW5/RHUPDNsE/fFC7ucZvSHxTxyIt+E3tfgZKaDyfbqjvnRRiRSZhDXpkCi+a+r3SLY2chZZNLP2hLlRvtEhlQwI8quff+Mc+2mwdxzZCYIA4ziWWKT74kavcgtcKfNUhwH9TreMa5eg+/+NT0E7ewAWVIMD7ycDGduCdJYwNo97cB/pudTFi3sFHdV734NYqCGqr4WZwtisdveSvJqMRTxCfA09b4S2KW7KyxPU/LUOwfmpKoUeMT8ov/PeXUWkfidusOggABDLFFUmzc9J8kh2hqUIG0qMYmoqJIACUM4vO0NJnSGbH5vTSkJS9I6HHTr2YSRbM96DQmaQTdktmsRt/xzn4QsvzTy1aM0VthscNet/q4meWRWTfDAk0mk8DsCpo9I6CNYU0Ihvh56C1iG59J7ZgbIp8y9y9XIa9+YVNndHrHMNTh98Ooy0LMd7UqgrJFLY1bm4bwUSfQC7Bzm7nueEh7PlQBaV8nFpEEXBjg83Al4oDAEhEYLLxvPdu9JdzS+aXL5KJIp7cPpeQZxwimGATLpoBOzkKfZQ66fbQqzTG8juw4YDAk8Ucdwbzd7IzqdA== X-Forefront-Antispam-Report-Untrusted: CIP:172.205.89.229; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:nebula.arm.com; PTR:InfoDomainNonexistent; CAT:NONE; SFS:(13230040)(376014)(82310400026)(1800799024)(36860700013)(13003099007); DIR:OUT; SFP:1101; X-MS-Exchange-Transport-CrossTenantHeadersStamped: PAVPR08MB9601 X-MS-Exchange-Transport-CrossTenantHeadersStripped: AMS0EPF000001A8.eurprd05.prod.outlook.com X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id-Prvs: 798b08e5-2ec7-4563-074e-08de7526f35c X-Microsoft-Antispam: BCL:0; ARA:13230040|82310400026|35042699022|376014|14060799003|1800799024|36860700013|13003099007; X-Microsoft-Antispam-Message-Info: SzgWr0kgQN2E/AMef+HDpmXN1AROz4+rTICuwn8eLxHJXh5SNXDkv7TSiZzcZOWcy4CFJRGXUlOEkdZ8OECb5oggU0RSEANeM+XhbICOH01oN7BxsoqyivSIfRRX+ghrIdSgzqQcnhlE5e6IcxJgINMkyAdOvloMyJ00oAJNA8ABvFUi4bmSMzMJJvyFv1sSXYo6C92S7zj9bPHWLJviOwtl/rXSfeYbOmuZZxCBUM6GitgGdMnIQLjX5b6GBGRnxr63O4EM7SPIj8qeZH1SF6fIygs6sRxQkB1HDPIcvMmFZtgr1oGqmwfx40Cp/ROw4fgEWyRM23vjBcINkloCB7P2PhhqIqD9EzPD1plozN9J4jvmuNbXpAaSvLGg8y1aErympj6uDJN9HYsyH69lYMV5Kbd0pOM26dqRsCU7cuxWD4zDPpTfTJ12qF8fHuyRy8cUAQsRxEPeTnDKfIWiLVbYsOf541wtHIq0tRsbCzRrh2JRChd6RY1U2pFUdqgNVXCcvgmrRtM1w+UMyVQAlxdOdL49OOSXzNH3XOTaOjoeRkiMmyDhDnw2Nn/OPhXp6nUx4zw9pi9XQGzH4LlR/KyvjZIbPx+FPHBXqk8KlRNc/tuQ3W5M/NFDqkZcScHfpaNFb/ONFTIvD50OVr3cdEwGon/rNGanZCN+U6iuSanlBR7k8QmzibYocVAFbf6jQC0L+N6S4ouaRXMpikwCvQz0SFjllLkEiuPV/GDEYgnKiLo7qyV+SM0ZP1TBwi98oC+IRZonu+3Lm6FZT3JueBWM+z5sqh+9JiaWtPv/L2BE2NvF3oP9v2JWQ9OeaZ6ahRJkSZ1F1LL7P5bu5BJj+A== X-Forefront-Antispam-Report: CIP:4.158.2.129; CTRY:GB; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:outbound-uk1.az.dlp.m.darktrace.com; PTR:InfoDomainNonexistent; CAT:NONE; SFS:(13230040)(82310400026)(35042699022)(376014)(14060799003)(1800799024)(36860700013)(13003099007); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: IoSX8+oLnwML2Ofnf91fHwayGZaJJFz2oFq5CeFajhtSa274HLY+yA+DrRy+twY/XkJ6yFwoxqggDzB3qHAWkJArwBcMV6hrGWf2Mq0tLCv24IkGACPTP0Bzsgpd1vW3iu5AD8Izx6iQsDwl+neWCzCggPO8uVf+qAIgE6/mMKuD1m+QzdtqQI0lEgsyIeK76WoH5th4aEabNvB+tTY9D5nnql2+aJjKQjvxRxhe7YBlTMdLhF1zk1YQbRKO7RpVRbMR1weG/4xHidjHGD/2Uv6PGdtlLeu128P3kMHYXvrDMTZHiRnZjBJxTrIIsKv08jLA2CpYVkhcFrOmQ5z4lszE2XUssEvE9xMeXO7Fgp/FY+NXO/zvbepRUpI538bQldPOdtzFuuKi59EntoYWWEqdfDxAyAjWJoqrs7ksXWMiAbk5CbabR/000uDF4Jz+ X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 26 Feb 2026 11:06:39.2423 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 5ee98ee4-3d04-4ed9-5f6e-08de75271d89 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[4.158.2.129]; Helo=[outbound-uk1.az.dlp.m.darktrace.com] X-MS-Exchange-CrossTenant-AuthSource: AMS0EPF000001A8.eurprd05.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: AS2PR08MB9414 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FORGED_SPF_HELO, GIT_PATCH_0, KAM_SHORT, SPF_HELO_PASS, SPF_NONE, TXREP, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org This patch vectorises remaining special cases and optimises some fast path performance for single and double precision exp, SVE and AdvSIMD. Moves most special case functions to header files to minimise code size. Uses NOINLINE in main case where half width alias is used to minimise codegen. Special case vectorisation performance increase of average 8x to greatest 9.5x. Special case improvements performance increase average 15% speed improvement to greatest 40%. Some fast path gains during rework of files. Fastest notable increase in exp2m1 AdvSIMD double precision of 26% improvement. Most fast paths improved by 5-10%. 8 unchanged. No regressions. Benchmarked on Neoverse V2 with GCC@15 Reviewed-by: Adhemerval Zanella --- Ok for Master? if so please commit for me as I do not have rights. Thanks Richard sysdeps/aarch64/fpu/exp10_advsimd.c | 85 ++++---- sysdeps/aarch64/fpu/exp10_sve.c | 82 +++----- sysdeps/aarch64/fpu/exp10f_advsimd.c | 54 ++--- sysdeps/aarch64/fpu/exp10f_sve.c | 47 +++-- sysdeps/aarch64/fpu/exp10m1_advsimd.c | 53 +++-- sysdeps/aarch64/fpu/exp10m1_sve.c | 91 ++++----- sysdeps/aarch64/fpu/exp10m1f_advsimd.c | 65 +++--- sysdeps/aarch64/fpu/exp10m1f_sve.c | 75 +++---- sysdeps/aarch64/fpu/exp2_advsimd.c | 64 +++--- sysdeps/aarch64/fpu/exp2_sve.c | 98 ++++----- sysdeps/aarch64/fpu/exp2f_advsimd.c | 47 ++--- sysdeps/aarch64/fpu/exp2f_sve.c | 49 +++-- sysdeps/aarch64/fpu/exp2m1_advsimd.c | 51 +++-- sysdeps/aarch64/fpu/exp2m1_sve.c | 193 +++++++----------- sysdeps/aarch64/fpu/exp2m1f_advsimd.c | 69 +++---- sysdeps/aarch64/fpu/exp2m1f_sve.c | 52 ++--- sysdeps/aarch64/fpu/exp_advsimd.c | 110 +++++----- sysdeps/aarch64/fpu/exp_sve.c | 92 +++------ sysdeps/aarch64/fpu/expf_advsimd.c | 50 ++--- sysdeps/aarch64/fpu/expf_sve.c | 116 +++++++++-- sysdeps/aarch64/fpu/expm1_advsimd.c | 115 +++++++++-- sysdeps/aarch64/fpu/expm1_sve.c | 120 +++++------ sysdeps/aarch64/fpu/expm1f_advsimd.c | 83 ++++++-- sysdeps/aarch64/fpu/expm1f_sve.c | 83 ++++---- sysdeps/aarch64/fpu/sv_exp_special_inline.h | 67 ++++++ sysdeps/aarch64/fpu/sv_expf_special_inline.h | 72 +++++++ .../aarch64/fpu/v_exp_special_case_inline.h | 49 +++++ sysdeps/aarch64/fpu/v_expf_special_inline.h | 53 +++++ 28 files changed, 1211 insertions(+), 974 deletions(-) create mode 100644 sysdeps/aarch64/fpu/sv_exp_special_inline.h create mode 100644 sysdeps/aarch64/fpu/sv_expf_special_inline.h create mode 100644 sysdeps/aarch64/fpu/v_exp_special_case_inline.h create mode 100644 sysdeps/aarch64/fpu/v_expf_special_inline.h diff --git a/sysdeps/aarch64/fpu/exp10_advsimd.c b/sysdeps/aarch64/fpu/exp10_advsimd.c index 7b923c7c144..c02cc86ab2d 100644 --- a/sysdeps/aarch64/fpu/exp10_advsimd.c +++ b/sysdeps/aarch64/fpu/exp10_advsimd.c @@ -18,53 +18,53 @@ . */ #include "v_math.h" +#include "v_exp_special_case_inline.h" /* Value of |x| above which scale overflows without special treatment. */ -#define SpecialBound 306.0 /* floor (log10 (2^1023)) - 1. */ +#define SpecialBound 0x1.33a6fa2f05a71p+8 /* log10(2^1022) ~ 307.66. */ + /* Value of n above which scale overflows even with special treatment. */ #define ScaleBound 163840.0 /* 1280.0 * N. */ const static struct data { - float64x2_t poly[4]; - float64x2_t log10_2, log2_10_hi, log2_10_lo, shift; + struct v_exp_special_data special_data; + double c1, c3; + double log2_10_hi, log2_10_lo; + float64x2_t c0, c2, log10_2, shift; float64x2_t special_bound, scale_thresh; } data = { + .special_data = V_EXP_SPECIAL_DATA, /* Coefficients generated using Remez algorithm. rel error: 0x1.5ddf8f28p-54 - abs error: 0x1.5ed266c8p-54 in [ -log10(2)/256, log10(2)/256 ] - maxerr: 1.14432 +0.5 ulp. */ - .poly = { V2 (0x1.26bb1bbb5524p1), V2 (0x1.53524c73cecdap1), - V2 (0x1.047060efb781cp1), V2 (0x1.2bd76040f0d16p0) }, - .log10_2 = V2 (0x1.a934f0979a371p8), /* N/log2(10). */ - .log2_10_hi = V2 (0x1.34413509f79ffp-9), /* log2(10)/N. */ - .log2_10_lo = V2 (-0x1.9dc1da994fd21p-66), + abs error: 0x1.5ed266c8p-54 in [ -log10(2)/256, log10(2)/256 ]. */ + .c0 = V2 (0x1.26bb1bbb5524p1), + .c1 = 0x1.53524c73cecdap1, + .c2 = V2 (0x1.047060efb781cp1), + .c3 = 0x1.2bd76040f0d16p0, + .log10_2 = V2 (0x1.a934f0979a371p8), /* N/log2(10). */ + .log2_10_hi = 0x1.34413509f79ffp-9, /* log2(10)/N. */ + .log2_10_lo = -0x1.9dc1da994fd21p-66, .shift = V2 (0x1.8p+52), .scale_thresh = V2 (ScaleBound), + /* Implementation triggers special case handling as soon as the scale + overflows, which is earlier than exp10's overflow bound + `log10(DBL_MAX) ≈ 308.25` or underflow bound + `log10(DBL_TRUE_MIN) ≈ -323.31`. + The absolute comparison catches all these cases efficiently, although + there is a small window where special cases are triggered + unnecessarily. */ .special_bound = V2 (SpecialBound), }; #define N (1 << V_EXP_TABLE_BITS) -#define IndexMask v_u64 (N - 1) - -# define SpecialOffset v_u64 (0x6000000000000000) /* 0x1p513. */ -/* SpecialBias1 + SpecialBias1 = asuint(1.0). */ -# define SpecialBias1 v_u64 (0x7000000000000000) /* 0x1p769. */ -# define SpecialBias2 v_u64 (0x3010000000000000) /* 0x1p-254. */ +#define IndexMask (N - 1) -static inline float64x2_t VPCS_ATTR -special_case (float64x2_t s, float64x2_t y, float64x2_t n, - const struct data *d) +static inline uint64x2_t +lookup_sbits (uint64x2_t i) { - /* 2^(n/N) may overflow, break it up into s1*s2. */ - uint64x2_t b = vandq_u64 (vcltzq_f64 (n), SpecialOffset); - float64x2_t s1 = vreinterpretq_f64_u64 (vsubq_u64 (SpecialBias1, b)); - float64x2_t s2 = vreinterpretq_f64_u64 ( - vaddq_u64 (vsubq_u64 (vreinterpretq_u64_f64 (s), SpecialBias2), b)); - uint64x2_t cmp = vcagtq_f64 (n, d->scale_thresh); - float64x2_t r1 = vmulq_f64 (s1, s1); - float64x2_t r0 = vmulq_f64 (vfmaq_f64 (s2, y, s2), s1); - return vbslq_f64 (cmp, r1, r0); + return (uint64x2_t){ __v_exp_data[i[0] & IndexMask], + __v_exp_data[i[1] & IndexMask] }; } /* Fast vector implementation of exp10. @@ -74,7 +74,7 @@ special_case (float64x2_t s, float64x2_t y, float64x2_t n, float64x2_t VPCS_ATTR V_NAME_D1 (exp10) (float64x2_t x) { const struct data *d = ptr_barrier (&data); - uint64x2_t cmp = vcageq_f64 (x, d->special_bound); + uint64x2_t cmp = vcagtq_f64 (x, d->special_bound); /* n = round(x/(log10(2)/N)). */ float64x2_t z = vfmaq_f64 (d->shift, x, d->log10_2); @@ -82,26 +82,27 @@ float64x2_t VPCS_ATTR V_NAME_D1 (exp10) (float64x2_t x) float64x2_t n = vsubq_f64 (z, d->shift); /* r = x - n*log10(2)/N. */ + float64x2_t log2_10_hl = vld1q_f64 (&d->log2_10_hi); float64x2_t r = x; - r = vfmsq_f64 (r, d->log2_10_hi, n); - r = vfmsq_f64 (r, d->log2_10_lo, n); + r = vfmsq_laneq_f64 (r, n, log2_10_hl, 0); + r = vfmsq_laneq_f64 (r, n, log2_10_hl, 1); uint64x2_t e = vshlq_n_u64 (u, 52 - V_EXP_TABLE_BITS); - uint64x2_t i = vandq_u64 (u, IndexMask); - /* y = exp10(r) - 1 ~= C0 r + C1 r^2 + C2 r^3 + C3 r^4. */ + /* poly = exp10(r) - 1 ~= C0 r + C1 r^2 + C2 r^3 + C3 r^4. */ + float64x2_t c13 = vld1q_f64 (&d->c1); + float64x2_t p = vfmaq_laneq_f64 (d->c0, r, c13, 0); + float64x2_t poly = vfmaq_laneq_f64 (d->c2, r, c13, 1); float64x2_t r2 = vmulq_f64 (r, r); - float64x2_t p = vfmaq_f64 (d->poly[0], r, d->poly[1]); - float64x2_t y = vfmaq_f64 (d->poly[2], r, d->poly[3]); - p = vfmaq_f64 (p, y, r2); - y = vmulq_f64 (r, p); + p = vfmaq_f64 (p, poly, r2); + poly = vmulq_f64 (r, p); - /* s = 2^(n/N). */ - u = v_lookup_u64 (__v_exp_data, i); - float64x2_t s = vreinterpretq_f64_u64 (vaddq_u64 (u, e)); + /* scale = 2^(n/N). */ + u = lookup_sbits (u); + float64x2_t scale = vreinterpretq_f64_u64 (vaddq_u64 (u, e)); if (__glibc_unlikely (v_any_u64 (cmp))) - return special_case (s, y, n, d); + return exp_special (poly, n, scale, d->scale_thresh, &d->special_data); - return vfmaq_f64 (s, y, s); + return vfmaq_f64 (scale, poly, scale); } diff --git a/sysdeps/aarch64/fpu/exp10_sve.c b/sysdeps/aarch64/fpu/exp10_sve.c index 8c95a925bc1..fd04bab44a0 100644 --- a/sysdeps/aarch64/fpu/exp10_sve.c +++ b/sysdeps/aarch64/fpu/exp10_sve.c @@ -18,14 +18,21 @@ . */ #include "sv_math.h" +#include "sv_exp_special_inline.h" -#define SpecialBound 307.0 /* floor (log10 (2^1023)). */ +/* Value of |x| above which scale overflows without special treatment. + log10(2^1022 + 1/128) ~ 307.65. */ +#define SpecialBound 0x1.33a7ae900b507p+8 static const struct data { - double c1, c3, c2, c4, c0; - double shift, log10_2, log2_10_hi, log2_10_lo, scale_thres, special_bound; + struct sv_exp_special_data special_data; + double log2_10_hi, log2_10_lo; + double log10_2, c0; + double c1, c3, c2, c4; + double shift, special_bound; } data = { + .special_data = SV_EXP_SPECIAL_DATA, /* Coefficients generated using Remez algorithm. rel error: 0x1.9fcb9b3p-60 abs error: 0x1.a20d9598p-60 in [ -log10(2)/128, log10(2)/128 ] @@ -40,43 +47,22 @@ static const struct data .log10_2 = 0x1.a934f0979a371p1, /* 1/log2(10). */ .log2_10_hi = 0x1.34413509f79ffp-2, /* log2(10). */ .log2_10_lo = -0x1.9dc1da994fd21p-59, - .scale_thres = 1280.0, .special_bound = SpecialBound, }; -#define SpecialOffset 0x6000000000000000 /* 0x1p513. */ -/* SpecialBias1 + SpecialBias1 = asuint(1.0). */ -#define SpecialBias1 0x7000000000000000 /* 0x1p769. */ -#define SpecialBias2 0x3010000000000000 /* 0x1p-254. */ - -/* Update of both special and non-special cases, if any special case is - detected. */ -static inline svfloat64_t -special_case (svbool_t pg, svfloat64_t s, svfloat64_t y, svfloat64_t n, - const struct data *d) +static svfloat64_t NOINLINE +special_exp (svfloat64_t scale, svfloat64_t poly, svfloat64_t n, svuint64_t u, + const struct sv_exp_special_data *ds) { - /* s=2^n may overflow, break it up into s=s1*s2, - such that exp = s + s*y can be computed as s1*(s2+s2*y) - and s1*s1 overflows only if n>0. */ - - /* If n<=0 then set b to 0x6, 0 otherwise. */ - svbool_t p_sign = svcmple (pg, n, 0.0); /* n <= 0. */ - svuint64_t b = svdup_u64_z (p_sign, SpecialOffset); - - /* Set s1 to generate overflow depending on sign of exponent n. */ - svfloat64_t s1 = svreinterpret_f64 (svsubr_x (pg, b, SpecialBias1)); - /* Offset s to avoid overflow in final result if n is below threshold. */ - svfloat64_t s2 = svreinterpret_f64 ( - svadd_x (pg, svsub_x (pg, svreinterpret_u64 (s), SpecialBias2), b)); - - /* |n| > 1280 => 2^(n) overflows. */ - svbool_t p_cmp = svacgt (pg, n, d->scale_thres); - - svfloat64_t r1 = svmul_x (svptrue_b64 (), s1, s1); - svfloat64_t r2 = svmla_x (pg, s2, s2, y); - svfloat64_t r0 = svmul_x (svptrue_b64 (), r2, s1); - - return svsel (p_cmp, r1, r0); + /* FEXPA zeroes the sign bit, however the sign is meaningful to the + special case function so needs to be copied. + e = sign bit of u << 46. */ + svuint64_t e = svand_x (svptrue_b64 (), svlsl_x (svptrue_b64 (), u, 46), + 0x8000000000000000); + /* Copy sign to scale. */ + scale = svreinterpret_f64 ( + svadd_x (svptrue_b64 (), e, svreinterpret_u64 (scale))); + return special_case (scale, poly, n, ds); } /* Fast vector implementation of exp10 using FEXPA instruction. @@ -86,12 +72,11 @@ special_case (svbool_t pg, svfloat64_t s, svfloat64_t y, svfloat64_t n, svfloat64_t SV_NAME_D1 (exp10) (svfloat64_t x, svbool_t pg) { const struct data *d = ptr_barrier (&data); - svbool_t no_big_scale = svacle (pg, x, d->special_bound); - svbool_t special = svnot_z (pg, no_big_scale); /* n = round(x/(log10(2)/N)). */ svfloat64_t shift = sv_f64 (d->shift); - svfloat64_t z = svmla_x (pg, shift, x, d->log10_2); + svfloat64_t log10_2_c0 = svld1rq (svptrue_b64 (), &d->log10_2); + svfloat64_t z = svmla_lane (shift, x, log10_2_c0, 0); svfloat64_t n = svsub_x (pg, z, shift); /* r = x - n*log10(2)/N. */ @@ -112,21 +97,14 @@ svfloat64_t SV_NAME_D1 (exp10) (svfloat64_t x, svbool_t pg) svfloat64_t p34 = svmla_lane (sv_f64 (d->c3), r, c24, 1); svfloat64_t p14 = svmla_x (pg, p12, p34, r2); - svfloat64_t y = svmla_x (pg, svmul_x (svptrue_b64 (), r, d->c0), r2, p14); + svfloat64_t poly = svmla_x (pg, svmul_lane (r, log10_2_c0, 1), r2, p14); - /* Assemble result as exp10(x) = 2^n * exp10(r). If |x| > SpecialBound + svbool_t special = svacge (svptrue_b64 (), x, d->special_bound); + /* Assemble result as exp(x) = 2^n * exp(r). If |x| > Thresh the multiplication may overflow, so use special case routine. */ - if (__glibc_unlikely (svptest_any (pg, special))) - { - /* FEXPA zeroes the sign bit, however the sign is meaningful to the - special case function so needs to be copied. - e = sign bit of u << 46. */ - svuint64_t e = svand_x (pg, svlsl_x (pg, u, 46), 0x8000000000000000); - /* Copy sign to scale. */ - scale = svreinterpret_f64 (svadd_x (pg, e, svreinterpret_u64 (scale))); - return special_case (pg, scale, y, n, d); - } + if (__glibc_unlikely (svptest_any (special, special))) + return special_exp (scale, poly, n, u, &d->special_data); /* No special case. */ - return svmla_x (pg, scale, scale, y); + return svmla_x (pg, scale, scale, poly); } diff --git a/sysdeps/aarch64/fpu/exp10f_advsimd.c b/sysdeps/aarch64/fpu/exp10f_advsimd.c index b6d21b1ead6..65230948723 100644 --- a/sysdeps/aarch64/fpu/exp10f_advsimd.c +++ b/sysdeps/aarch64/fpu/exp10f_advsimd.c @@ -18,22 +18,24 @@ . */ #include "v_math.h" +#include "v_expf_special_inline.h" -#define ScaleBound 192.0f +/* Value of |x| above which scale overflows without special treatment. */ +#define SpecialBound 0x1.30a46f1561911p+5 /* ≈ 38.09. */ static const struct data { - float32x4_t c0, c1, c3; + struct v_expf_special_data special_data; float log10_2_high, log10_2_low, c2, c4; + float32x4_t c0, c1, c3; float32x4_t inv_log10_2, special_bound; - uint32x4_t exponent_bias, special_offset, special_bias; - float32x4_t scale_thresh; + uint32x4_t exponent_bias; } data = { + .special_data = V_EXPF_SPECIAL_DATA, /* Coefficients generated using Remez algorithm with minimisation of relative error. rel error: 0x1.89dafa3p-24 - abs error: 0x1.167d55p-23 in [-log10(2)/2, log10(2)/2] - maxerr: 1.85943 +0.5 ulp. */ + abs error: 0x1.167d55p-23 in [-log10(2)/2, log10(2)/2]. */ .c0 = V4 (0x1.26bb16p+1f), .c1 = V4 (0x1.5350d2p+1f), .c2 = 0x1.04744ap+1f, @@ -42,35 +44,19 @@ static const struct data .inv_log10_2 = V4 (0x1.a934fp+1), .log10_2_high = 0x1.344136p-2, .log10_2_low = 0x1.ec10cp-27, - /* rint (log2 (2^127 / (1 + sqrt (2)))). */ - .special_bound = V4 (126.0f), + .special_bound = V4 (SpecialBound), + /* Implementation triggers special case handling as soon as the scale + overflows, which is earlier than exp10f's + true overflow bound `log10(FLT_MAX) ≈ 38.53` or + underflow bound `log10(FLT_TRUE_MIN) ≈ -44.85`. + The absolute comparison catches all these cases efficiently, although + there is a small window where special cases are triggered + unnecessarily. */ .exponent_bias = V4 (0x3f800000), - .special_offset = V4 (0x82000000), - .special_bias = V4 (0x7f000000), - .scale_thresh = V4 (ScaleBound) }; -# define SpecialBound 126.0f - -static float32x4_t VPCS_ATTR NOINLINE -special_case (float32x4_t poly, float32x4_t n, uint32x4_t e, uint32x4_t cmp1, - float32x4_t scale, const struct data *d) -{ - /* 2^n may overflow, break it up into s1*s2. */ - uint32x4_t b = vandq_u32 (vclezq_f32 (n), d->special_offset); - float32x4_t s1 = vreinterpretq_f32_u32 (vaddq_u32 (b, d->special_bias)); - float32x4_t s2 = vreinterpretq_f32_u32 (vsubq_u32 (e, b)); - uint32x4_t cmp2 = vcagtq_f32 (n, d->scale_thresh); - float32x4_t r2 = vmulq_f32 (s1, s1); - float32x4_t r1 = vmulq_f32 (vfmaq_f32 (s2, poly, s2), s1); - /* Similar to r1 but avoids double rounding in the subnormal range. */ - float32x4_t r0 = vfmaq_f32 (scale, poly, scale); - float32x4_t r = vbslq_f32 (cmp1, r1, r0); - return vbslq_f32 (cmp2, r2, r); -} - -/* Fast vector implementation of single-precision exp10. - Algorithm is accurate to 2.36 ULP. +/* Single-precision vector exp10f routine. + The maximum error is 1.86 +0.5 ULP: _ZGVnN4v_exp10f(0x1.be2b36p+1) got 0x1.7e79c4p+11 want 0x1.7e79cp+11. */ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp10) (float32x4_t x) @@ -88,7 +74,7 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp10) (float32x4_t x) float32x4_t scale = vreinterpretq_f32_u32 (vaddq_u32 (e, d->exponent_bias)); - uint32x4_t cmp = vcagtq_f32 (n, d->special_bound); + uint32x4_t cmp = vcageq_f32 (x, d->special_bound); float32x4_t r2 = vmulq_f32 (r, r); float32x4_t p12 = vfmaq_laneq_f32 (d->c1, r, log10_2_c24, 2); @@ -97,7 +83,7 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp10) (float32x4_t x) float32x4_t poly = vfmaq_f32 (vmulq_f32 (r, d->c0), p14, r2); if (__glibc_unlikely (v_any_u32 (cmp))) - return special_case (poly, n, e, cmp, scale, d); + return expf_special (poly, n, e, cmp, scale, &d->special_data); return vfmaq_f32 (scale, poly, scale); } libmvec_hidden_def (V_NAME_F1 (exp10)) diff --git a/sysdeps/aarch64/fpu/exp10f_sve.c b/sysdeps/aarch64/fpu/exp10f_sve.c index 4911a166e39..0c426797ff0 100644 --- a/sysdeps/aarch64/fpu/exp10f_sve.c +++ b/sysdeps/aarch64/fpu/exp10f_sve.c @@ -19,14 +19,18 @@ #include "sv_math.h" -/* For x < -Thres (-log10(2^126)), the result is subnormal and not handled +/* For x < -SpecialBound, the result is subnormal and not handled correctly by FEXPA. */ -#define Thres 0x1.2f702p+5 +#define SpecialBound 0x1.2f702p+5 /* log10(2^126) ~ 37.93. */ + +/* Values of x over which exp overflows or underflows. */ +#define InfBound 0x1.34ccccccccccdp+5 /* 38.6. */ +#define ZeroBound -0x1.68ccccccccccdp+5 /* -45.1. */ static const struct data { float log10_2, log2_10_hi, log2_10_lo, c1; - float c0, shift, thres; + float c0, shift, special_bound, inf_bound, zero_bound; } data = { /* Coefficients generated using Remez algorithm with minimisation of relative error. */ @@ -37,7 +41,9 @@ static const struct data .log10_2 = 0x1.a934fp+1, .log2_10_hi = 0x1.344136p-2, .log2_10_lo = -0x1.ec10cp-27, - .thres = Thres, + .special_bound = SpecialBound, + .inf_bound = InfBound, + .zero_bound = ZeroBound, }; static inline svfloat32_t @@ -63,25 +69,40 @@ sv_exp10f_inline (svfloat32_t x, const svbool_t pg, const struct data *d) /* Polynomial evaluation: poly(r) ~ exp10(r)-1. */ svfloat32_t poly = svmla_lane (sv_f32 (d->c0), r, lane_consts, 3); poly = svmul_x (pg, poly, r); + return svmla_x (pg, scale, scale, poly); } static svfloat32_t NOINLINE -special_case (svfloat32_t x, svbool_t special, const struct data *d) +special_case (svfloat32_t x, svbool_t pg, svbool_t special, + const struct data *d) { - return sv_call_f32 (exp10f, x, sv_exp10f_inline (x, svptrue_b32 (), d), - special); + /* This deals with overflow and underflow in exponential for special case + lanes. */ + svbool_t is_inf = svcmpgt (pg, x, d->inf_bound); + svbool_t is_zero = svcmplt (pg, x, d->zero_bound); + + /* The input `x` is further reduced (to `x/2`) to allow for accurate + approximation on the interval `x > SpecialBound ~ 0x1.2f702p+5`. */ + x = svmul_m (special, x, 0.5); + + /* Computes exp(x/2), and set lanes with underflow/overflow. */ + svfloat32_t half_exp = sv_exp10f_inline (x, svptrue_b32 (), d); + half_exp = svmul_m (special, half_exp, half_exp); + half_exp = svsel (is_inf, sv_f32 (INFINITY), half_exp); + + return svsel (is_zero, sv_f32 (0), half_exp); } /* Single-precision SVE exp10f routine. Based on the FEXPA instruction. - Worst case error is 1.10 ULP. - _ZGVsMxv_exp10f (0x1.cc76dep+3) got 0x1.be0172p+47 - want 0x1.be017p+47. */ + Worst case error is 2.86 ULP +0.50 ULP. + _ZGVsMxv_exp10f (0x1.31b778p+5) got 0x1.ed399p+126 + want 0x1.ed398ap+126. */ svfloat32_t SV_NAME_F1 (exp10) (svfloat32_t x, const svbool_t pg) { const struct data *d = ptr_barrier (&data); - svbool_t special = svacgt (pg, x, d->thres); + svbool_t special = svacgt (pg, x, d->special_bound); if (__glibc_unlikely (svptest_any (special, special))) - return special_case (x, special, d); - return sv_exp10f_inline (x, pg, d); + return special_case (x, pg, special, d); + return sv_exp10f_inline (x, svptrue_b32 (), d); } diff --git a/sysdeps/aarch64/fpu/exp10m1_advsimd.c b/sysdeps/aarch64/fpu/exp10m1_advsimd.c index a1133094883..e2897a75787 100644 --- a/sysdeps/aarch64/fpu/exp10m1_advsimd.c +++ b/sysdeps/aarch64/fpu/exp10m1_advsimd.c @@ -19,9 +19,10 @@ #include "v_math.h" +#include "v_exp_special_case_inline.h" /* Value of |x| above which scale overflows without special treatment. */ -#define SpecialBound 306.0 /* floor (log10 (2^1023)) - 1. */ +#define SpecialBound 0x1.33f424c404a73p+8 /* log10(2^1023) ~ 307.96. */ /* Value of n above which scale overflows even with special treatment. */ #define ScaleBound 163840.0 /* 1280.0 * N. */ @@ -29,11 +30,9 @@ /* Value of |x| below which scale - 1 contributes produces large error. */ #define TableBound 0x1.a308a4198c9d7p-4 /* log10(2) * 87/256. */ -#define N (1 << V_EXP_TABLE_BITS) -#define IndexMask (N - 1) - const static struct data { + struct v_exp_special_data special_data; double c6, c0; double c2, c4; double log2_10_hi, log2_10_lo; @@ -44,6 +43,9 @@ const static struct data float64x2_t rnd2zero; uint64_t scalem1[88]; } data = { + /* Special case data from helper. */ + .special_data = V_EXP_SPECIAL_DATA, + /* Coefficients generated using Remez algorithm. */ .c0 = 0x1.26bb1bbb55516p1, .c1 = V2 (0x1.53524c73cea69p1), @@ -57,12 +59,18 @@ const static struct data .rnd2zero = V2 (-0x1.34413509f79ffp-10), /* (2^-8)/log2(10). */ .sm1_tbl_off = V2 (24), .sm1_tbl_mask = V2 (0x3f), - .log10_2 = V2 (0x1.a934f0979a371p8), /* N/log2(10). */ .log2_10_hi = 0x1.34413509f79ffp-9, /* log2(10)/N. */ .log2_10_lo = -0x1.9dc1da994fd21p-66, .shift = V2 (0x1.8p+52), .scale_thresh = V2 (ScaleBound), + /* Implementation triggers special case handling as soon as the scale + overflows, which is earlier than exp10m1's overflow bound + `log10(DBL_MAX) ≈ 308.25` or underflow bound + `log10(DBL_TRUE_MIN) ≈ -323.31`. + The absolute comparison catches all these cases efficiently, although + there is a small window where special cases are triggered + unnecessarily. */ .special_bound = V2 (SpecialBound), /* Table containing 2^x - 1, for 2^x values close to 1. @@ -104,25 +112,8 @@ const static struct data } }; -#define SpecialOffset v_u64 (0x6000000000000000) /* 0x1p513. */ -/* SpecialBias1 + SpecialBias1 = asuint(1.0). */ -#define SpecialBias1 v_u64 (0x7000000000000000) /* 0x1p769. */ -#define SpecialBias2 v_u64 (0x3010000000000000) /* 0x1p-254. */ - -static inline float64x2_t VPCS_ATTR -special_case (float64x2_t s, float64x2_t y, float64x2_t n, - const struct data *d) -{ - /* 2^(n/N) may overflow, break it up into s1*s2. */ - uint64x2_t b = vandq_u64 (vcltzq_f64 (n), SpecialOffset); - float64x2_t s1 = vreinterpretq_f64_u64 (vsubq_u64 (SpecialBias1, b)); - float64x2_t s2 = vreinterpretq_f64_u64 ( - vaddq_u64 (vsubq_u64 (vreinterpretq_u64_f64 (s), SpecialBias2), b)); - uint64x2_t cmp = vcagtq_f64 (n, d->scale_thresh); - float64x2_t r1 = vmulq_f64 (s1, s1); - float64x2_t r0 = vmulq_f64 (vfmaq_f64 (s2, y, s2), s1); - return vsubq_f64 (vbslq_f64 (cmp, r1, r0), v_f64 (1.0)); -} +#define N (1 << V_EXP_TABLE_BITS) +#define IndexMask (N - 1) static inline uint64x2_t lookup_sbits (uint64x2_t i) @@ -151,7 +142,7 @@ lookup_sm1bits (float64x2_t x, uint64x2_t u, const struct data *d) float64x2_t VPCS_ATTR V_NAME_D1 (exp10m1) (float64x2_t x) { const struct data *d = ptr_barrier (&data); - uint64x2_t cmp = vcageq_f64 (x, d->special_bound); + uint64x2_t cmp = vcagtq_f64 (x, d->special_bound); /* n = round(x/(log10(2)/N)). */ float64x2_t z = vfmaq_f64 (d->shift, x, d->log10_2); @@ -177,7 +168,7 @@ float64x2_t VPCS_ATTR V_NAME_D1 (exp10m1) (float64x2_t x) float64x2_t p16 = vfmaq_f64 (p12, r2, p36); float64x2_t p0 = vmulq_laneq_f64 (r, c60, 1); - float64x2_t p = vfmaq_f64 (p0, r2, p16); + float64x2_t poly = vfmaq_f64 (p0, r2, p16); uint64x2_t e = vshlq_n_u64 (u, 52 - V_EXP_TABLE_BITS); @@ -192,11 +183,15 @@ float64x2_t VPCS_ATTR V_NAME_D1 (exp10m1) (float64x2_t x) scalem1 = vbslq_f64 (is_small, lookup_sm1bits (x, u, d), scalem1); /* Construct exp10m1 = (scale - 1) + scale * poly. */ - float64x2_t y = vfmaq_f64 (scalem1, scale, p); + float64x2_t y = vfmaq_f64 (scalem1, scale, poly); /* Fallback to special case for lanes with overflow. */ if (__glibc_unlikely (v_any_u64 (cmp))) - return vbslq_f64 (cmp, special_case (scale, p, n, d), y); - + { + float64x2_t special + = (exp_special (poly, n, scale, d->scale_thresh, &d->special_data)); + float64x2_t specialm1 = vsubq_f64 (special, v_f64 (1.0)); + return vbslq_f64 (cmp, specialm1, y); + } return y; } diff --git a/sysdeps/aarch64/fpu/exp10m1_sve.c b/sysdeps/aarch64/fpu/exp10m1_sve.c index d1cdb0cdff5..e93883c2cf4 100644 --- a/sysdeps/aarch64/fpu/exp10m1_sve.c +++ b/sysdeps/aarch64/fpu/exp10m1_sve.c @@ -18,24 +18,26 @@ . */ #include "sv_math.h" +#include "sv_exp_special_inline.h" -/* Value of |x| above which scale overflows without special treatment. */ -#define SpecialBound 0x1.33f4bedd4fa70p+8 /* log10(2^(1023 + 1/128)). */ - -/* Value of n above which scale overflows even with special treatment. */ -#define ScaleBound 1280.0 +/* Value of |x| above which scale overflows without special treatment. + log10(2^1023 + 1/128) ~ 307.96. */ +#define SpecialBound 0x1.33f4bedd4f3fdp+8 /* Value of |x| below which scale - 1 contributes produces large error. */ #define FexpaBound 0x1.2a9f2b61a7e2p-4 /* 31*(log10(2)/128). */ static const struct data { + struct sv_exp_special_data special_data; double log2_10_hi, log2_10_lo; + double log10_2, c1; double c3, c5; - double c0, c1, c2, c4; - double shift, log10_2, special_bound; + double c0, c2, c4; + double shift, special_bound; uint64_t scalem1[32]; } data = { + .special_data = SV_EXP_SPECIAL_DATA, /* Coefficients generated using Remez algorithm. */ .c0 = 0x1.26bb1bbb55516p1, .c1 = 0x1.53524c73cea6ap1, @@ -70,41 +72,22 @@ static const struct data }, }; -#define SpecialOffset 0x6000000000000000 /* 0x1p513. */ -/* SpecialBias1 + SpecialBias1 = asuint(1.0). */ -#define SpecialBias1 0x7000000000000000 /* 0x1p769. */ -#define SpecialBias2 0x3010000000000000 /* 0x1p-254. */ - -static NOINLINE svfloat64_t -special_case (svbool_t pg, svfloat64_t y, svfloat64_t s, svfloat64_t p, - svfloat64_t n) +static svfloat64_t NOINLINE +special_m1 (svbool_t special, svfloat64_t y, svfloat64_t z, svfloat64_t scale, + svfloat64_t poly, svfloat64_t n, + const struct sv_exp_special_data *ds) { - /* s=2^n may overflow, break it up into s=s1*s2, - such that exp = s + s*y can be computed as s1*(s2+s2*y) - and s1*s1 overflows only if n>0. */ - - /* If n<=0 then set b to 0x6, 0 otherwise. */ - svbool_t p_sign = svcmple (pg, n, 0.0); /* n <= 0. */ - svuint64_t b - = svdup_u64_z (p_sign, SpecialOffset); /* Inactive lanes set to 0. */ - - /* Set s1 to generate overflow depending on sign of exponent n, - ie. s1 = 0x70...0 - b. */ - svfloat64_t s1 = svreinterpret_f64 (svsubr_x (pg, b, SpecialBias1)); - /* Offset s to avoid overflow in final result if n is below threshold. - ie. s2 = as_u64 (s) - 0x3010...0 + b. */ - svfloat64_t s2 = svreinterpret_f64 ( - svadd_x (pg, svsub_x (pg, svreinterpret_u64 (s), SpecialBias2), b)); - - /* |n| > 1280 => 2^(n) overflows. */ - svbool_t p_cmp = svacgt (pg, n, ScaleBound); - - svfloat64_t r1 = svmul_x (svptrue_b64 (), s1, s1); - svfloat64_t r2 = svmla_x (pg, s2, s2, p); - svfloat64_t r0 = svmul_x (svptrue_b64 (), r2, s1); - - svbool_t is_safe = svacle (pg, n, 1023); /* Only correct special lanes. */ - return svsel (is_safe, y, svsub_x (pg, svsel (p_cmp, r1, r0), 1.0)); + /* FEXPA zeroes the sign bit, however the sign is meaningful to the + special case function so needs to be copied. + e = sign bit of u << 46. */ + svuint64_t u = svreinterpret_u64 (z); + svuint64_t e = svand_x (svptrue_b64 (), svlsl_x (svptrue_b64 (), u, 46), + 0x8000000000000000); + /* Copy sign to scale. */ + scale = svreinterpret_f64 ( + svadd_x (svptrue_b64 (), e, svreinterpret_u64 (scale))); + svfloat64_t special_result = special_case (scale, poly, n, ds); + return svsel (special, svsub_x (svptrue_b64 (), special_result, 1.0), y); } /* FEXPA based SVE exp10m1 algorithm. @@ -114,11 +97,11 @@ special_case (svbool_t pg, svfloat64_t y, svfloat64_t s, svfloat64_t p, svfloat64_t SV_NAME_D1 (exp10m1) (svfloat64_t x, svbool_t pg) { const struct data *d = ptr_barrier (&data); - svbool_t special = svacgt (pg, x, d->special_bound); /* n = round(x/(log10(2)/N)). */ svfloat64_t shift = sv_f64 (d->shift); - svfloat64_t z = svmla_x (pg, shift, x, d->log10_2); + svfloat64_t log10_2_c1 = svld1rq (svptrue_b64 (), &d->log10_2); + svfloat64_t z = svmla_lane (shift, x, log10_2_c1, 0); svfloat64_t n = svsub_x (pg, z, shift); /* r = x - n*log10(2)/N. */ @@ -135,13 +118,13 @@ svfloat64_t SV_NAME_D1 (exp10m1) (svfloat64_t x, svbool_t pg) svfloat64_t c24 = svld1rq (svptrue_b64 (), &d->c3); /* Approximate exp10(r) using polynomial. */ svfloat64_t r2 = svmul_x (svptrue_b64 (), r, r); - svfloat64_t p01 = svmla_x (pg, sv_f64 (d->c0), r, sv_f64 (d->c1)); + svfloat64_t p01 = svmla_lane (sv_f64 (d->c0), r, log10_2_c1, 1); svfloat64_t p23 = svmla_lane (sv_f64 (d->c2), r, c24, 0); svfloat64_t p45 = svmla_lane (sv_f64 (d->c4), r, c24, 1); svfloat64_t p25 = svmla_x (pg, p23, p45, r2); svfloat64_t p05 = svmla_x (pg, p01, p25, r2); - svfloat64_t p = svmul_x (pg, p05, r); + svfloat64_t poly = svmul_x (pg, p05, r); svfloat64_t scalem1 = svsub_x (pg, scale, 1.0); @@ -166,19 +149,15 @@ svfloat64_t SV_NAME_D1 (exp10m1) (svfloat64_t x, svbool_t pg) scalem1 = svsel (is_small, lookup, scalem1); } - svfloat64_t y = svmla_x (pg, scalem1, scale, p); - + /* Fallback to special case for lanes with overflow. */ + svbool_t special = svacgt (svptrue_b64 (), x, d->special_bound); /* FEXPA returns nan for large inputs so we special case those. */ - if (__glibc_unlikely (svptest_any (pg, special))) + if (__glibc_unlikely (svptest_any (special, special))) { - /* FEXPA zeroes the sign bit, however the sign is meaningful to the - special case function so needs to be copied. - e = sign bit of u << 46. */ - svuint64_t e = svand_x (pg, svlsl_x (pg, u, 46), 0x8000000000000000); - /* Copy sign to scale. */ - scale = svreinterpret_f64 (svadd_x (pg, e, svreinterpret_u64 (scale))); - return special_case (pg, y, scale, p, n); + svfloat64_t y = svmla_x (svptrue_b64 (), scalem1, scale, poly); + return special_m1 (special, y, z, scale, poly, n, &d->special_data); } - return y; + /* return expm1 = (scale - 1) + (scale * poly). */ + return svmla_x (pg, scalem1, scale, poly); } diff --git a/sysdeps/aarch64/fpu/exp10m1f_advsimd.c b/sysdeps/aarch64/fpu/exp10m1f_advsimd.c index 1471fdb5165..60f112114a0 100644 --- a/sysdeps/aarch64/fpu/exp10m1f_advsimd.c +++ b/sysdeps/aarch64/fpu/exp10m1f_advsimd.c @@ -18,24 +18,26 @@ . */ #include "v_math.h" +#include "v_expf_special_inline.h" /* Value of |x| above which scale overflows without special treatment. */ -#define SpecialBound 126.0f /* rint (log2 (2^127 / (1 + sqrt (2)))). */ - -/* Value of n above which scale overflows even with special treatment. */ -#define ScaleBound 192.0f +#define SpecialBound 0x1.330cf3ce9955ap+5 /* ≈ 38.3813. */ static const struct data { - float log10_2_high, log10_2_low; + struct v_expf_special_data special_data; + uint32x4_t exponent_bias; float log10_lo, c2, c4, c6; + float log10_2_high, log10_2_low; float32x4_t log10_hi, c1, c3, c5, c7, c8; float32x4_t inv_log10_2, special_bound; - uint32x4_t exponent_bias, special_offset, special_bias; - float32x4_t scale_thresh; } data = { + .special_data = V_EXPF_SPECIAL_DATA, /* Coefficients generated using Remez algorithm with minimisation of relative - error. */ + error. */ + .inv_log10_2 = V4 (0x1.a934fp+1), + .log10_2_high = 0x1.344136p-2, + .log10_2_low = 0x1.ec10cp-27, .log10_hi = V4 (0x1.26bb1b8000000p+1), .log10_lo = 0x1.daaa8b0000000p-26, .c1 = V4 (0x1.53524ep1), @@ -46,33 +48,17 @@ static const struct data .c6 = -0x1.05e38ep-4, .c7 = V4 (-0x1.c79f4ap-4), .c8 = V4 (0x1.2d6f34p1), - .inv_log10_2 = V4 (0x1.a934fp+1), - .log10_2_high = 0x1.344136p-2, - .log10_2_low = 0x1.ec10cp-27, .exponent_bias = V4 (0x3f800000), - .special_offset = V4 (0x82000000), - .special_bias = V4 (0x7f000000), - .scale_thresh = V4 (ScaleBound), + /* Implementation triggers special case handling as soon as the scale + overflows, which is earlier than exp10m1f's + true overflow bound `log10(FLT_MAX) ≈ 38.53` or + underflow bound `log10(FLT_TRUE_MIN) ≈ -44.85`. + The absolute comparison catches all these cases efficiently, although + there is a small window where special cases are triggered + unnecessarily. */ .special_bound = V4 (SpecialBound), }; -static float32x4_t VPCS_ATTR NOINLINE -special_case (float32x4_t poly, float32x4_t n, uint32x4_t e, uint32x4_t cmp1, - float32x4_t scale, const struct data *d) -{ - /* 2^n may overflow, break it up into s1*s2. */ - uint32x4_t b = vandq_u32 (vclezq_f32 (n), d->special_offset); - float32x4_t s1 = vreinterpretq_f32_u32 (vaddq_u32 (b, d->special_bias)); - float32x4_t s2 = vreinterpretq_f32_u32 (vsubq_u32 (e, b)); - uint32x4_t cmp2 = vcagtq_f32 (n, d->scale_thresh); - float32x4_t r2 = vmulq_f32 (s1, s1); - float32x4_t r1 = vmulq_f32 (vfmaq_f32 (s2, poly, s2), s1); - /* Similar to r1 but avoids double rounding in the subnormal range. */ - float32x4_t r0 = vfmaq_f32 (scale, poly, scale); - float32x4_t r = vbslq_f32 (cmp1, r1, r0); - return vsubq_f32 (vbslq_f32 (cmp2, r2, r), v_f32 (1.0f)); -} - /* Fast vector implementation of single-precision exp10m1. Algorithm is accurate to 1.70 + 0.5 ULP. _ZGVnN4v_exp10m1f(0x1.36f94cp-3) got 0x1.ac96acp-2 @@ -82,16 +68,17 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp10m1) (float32x4_t x) const struct data *d = ptr_barrier (&data); /* exp10(x) = 2^n * 10^r = 2^n * (1 + poly (r)), - with poly(r) in [1/sqrt(2), sqrt(2)] and + with 1 + poly(r) in [1/sqrt(2), sqrt(2)] and x = r + n * log10 (2), with r in [-log10(2)/2, log10(2)/2]. */ - float32x4_t log10_2 = vld1q_f32 (&d->log10_2_high); float32x4_t n = vrndaq_f32 (vmulq_f32 (x, d->inv_log10_2)); + float32x4_t log10_2 = vld1q_f32 (&d->log10_2_high); float32x4_t r = vfmsq_laneq_f32 (x, n, log10_2, 0); r = vfmaq_laneq_f32 (r, n, log10_2, 1); - uint32x4_t e = vshlq_n_u32 (vreinterpretq_u32_s32 (vcvtaq_s32_f32 (n)), 23); + uint32x4_t e = vshlq_n_u32 (vreinterpretq_u32_s32 (vcvtaq_s32_f32 (n)), 23); float32x4_t scale = vreinterpretq_f32_u32 (vaddq_u32 (e, d->exponent_bias)); - uint32x4_t cmp = vcagtq_f32 (n, d->special_bound); + + uint32x4_t cmp = vcageq_f32 (x, d->special_bound); /* Pairwise Horner scheme. */ float32x4_t log10lo_c246 = vld1q_f32 (&d->log10_lo); @@ -112,8 +99,12 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp10m1) (float32x4_t x) /* Fallback to special case for lanes with overflow. */ if (__glibc_unlikely (v_any_u32 (cmp))) - return vbslq_f32 (cmp, special_case (poly, n, e, cmp, scale, d), y); - + { + float32x4_t special + = (expf_special (poly, n, e, cmp, scale, &d->special_data)); + float32x4_t specialm1 = vsubq_f32 (special, v_f32 (1.0f)); + return vbslq_f32 (cmp, specialm1, y); + } return y; } libmvec_hidden_def (V_NAME_F1 (exp10m1)) diff --git a/sysdeps/aarch64/fpu/exp10m1f_sve.c b/sysdeps/aarch64/fpu/exp10m1f_sve.c index 8cc16ffd909..de0b0b8d1d6 100644 --- a/sysdeps/aarch64/fpu/exp10m1f_sve.c +++ b/sysdeps/aarch64/fpu/exp10m1f_sve.c @@ -18,22 +18,22 @@ . */ #include "sv_math.h" +#include "sv_expf_special_inline.h" -/* Value of |x| above which scale overflows without special treatment. */ -#define SpecialBound 126.0f /* rint (log2 (2^127 / (1 + sqrt (2)))). */ - -/* Value of n above which scale overflows even with special treatment. */ -#define ScaleBound 192.0f +/* Value of |x| above which scale overflows without special treatment. + log10(2^(127 + 0.5)) ~ 38.3813244. */ +#define SpecialBound 0x1.330cf3ce9955ap+5 static const struct data { - float log10_2_high, log10_2_low; - float log10_lo, c2, c4, c6, c8; - float32_t log10_hi, c1, c3, c5, c7; - float32_t inv_log10_2, special_bound; - uint32_t exponent_bias, special_offset, special_bias; - float32_t scale_thresh; + struct sv_expf_special_data special_data; + float c2, c4, c6, c8; + float log10_2_high, log10_2_low, inv_log10_2, log10_lo; + float32_t c1, c3, c5, c7; + float32_t log10_hi, special_bound; + uint32_t exponent_bias; } data = { + .special_data = SV_EXPF_SPECIAL_DATA, /* Coefficients generated using Remez algorithm with minimisation of relative error. */ .log10_hi = 0x1.26bb1b8000000p+1, @@ -50,31 +50,9 @@ static const struct data .log10_2_high = 0x1.344136p-2, .log10_2_low = 0x1.ec10cp-27, .exponent_bias = 0x3f800000, - .special_offset = 0x82000000, - .special_bias = 0x7f000000, - .scale_thresh = ScaleBound, .special_bound = SpecialBound, }; -static svfloat32_t NOINLINE -special_case (svfloat32_t poly, svfloat32_t n, svuint32_t e, svbool_t cmp1, - svfloat32_t scale, const struct data *d) -{ - svbool_t b = svcmple (svptrue_b32 (), n, 0.0f); - svfloat32_t s1 = svreinterpret_f32 ( - svsel (b, sv_u32 (d->special_offset + d->special_bias), - sv_u32 (d->special_bias))); - svfloat32_t s2 - = svreinterpret_f32 (svsub_m (b, e, sv_u32 (d->special_offset))); - svbool_t cmp2 = svacgt (svptrue_b32 (), n, d->scale_thresh); - svfloat32_t r2 = svmul_x (svptrue_b32 (), s1, s1); - svfloat32_t r1 - = svmul_x (svptrue_b32 (), svmla_x (svptrue_b32 (), s2, poly, s2), s1); - svfloat32_t r0 = svmla_x (svptrue_b32 (), scale, poly, scale); - svfloat32_t r = svsel (cmp1, r1, r0); - return svsub_x (svptrue_b32 (), svsel (cmp2, r2, r), 1.0f); -} - /* Fast vector implementation of single-precision exp10. Algorithm is accurate to 1.68 + 0.5 ULP. _ZGVnN4v_exp10m1f(0x1.3aeffep-3) got 0x1.b3139p-2 @@ -83,19 +61,19 @@ svfloat32_t SV_NAME_F1 (exp10m1) (svfloat32_t x, const svbool_t pg) { const struct data *d = ptr_barrier (&data); + /* This vector is reliant on layout of data - it contains constants + that can be used with _lane forms of svmla/svmls/svmul. Values are: + [ log10_2_high, log10_2_low, inv_log10_2, log10_lo ]. */ + svfloat32_t log10 = svld1rq (svptrue_b32 (), &d->log10_2_high); + /* exp10(x) = 2^n * 10^r = 2^n * (1 + poly (r)), with poly(r) in [1/sqrt(2), sqrt(2)] and x = r + n * log10 (2), with r in [-log10(2)/2, log10(2)/2]. */ - svfloat32_t log10_2 = svld1rq (svptrue_b32 (), &d->log10_2_high); - svfloat32_t n = svrinta_x (pg, svmul_x (pg, x, d->inv_log10_2)); - svfloat32_t r = svmls_lane_f32 (x, n, log10_2, 0); - r = svmla_lane_f32 (r, n, log10_2, 1); - - svuint32_t e = svlsl_x (pg, svreinterpret_u32 (svcvt_s32_x (pg, n)), 23); + svfloat32_t n = svrinta_x (pg, svmul_lane (x, log10, 2)); + svfloat32_t r = svmls_lane_f32 (x, n, log10, 0); + r = svmla_lane_f32 (r, n, log10, 1); - svfloat32_t scale - = svreinterpret_f32 (svadd_n_u32_x (pg, e, d->exponent_bias)); - svbool_t cmp = svacgt_n_f32 (pg, n, d->special_bound); + svfloat32_t scale = svscale_x (pg, sv_f32 (1.0f), svcvt_s32_x (pg, n)); /* Pairwise Horner scheme. */ svfloat32_t r2 = svmul_x (pg, r, r); @@ -107,16 +85,13 @@ svfloat32_t SV_NAME_F1 (exp10m1) (svfloat32_t x, const svbool_t pg) svfloat32_t p58 = svmla_x (pg, p56, r2, p78); svfloat32_t p36 = svmla_x (pg, p34, r2, p58); svfloat32_t p16 = svmla_x (pg, p12, r2, p36); - - svfloat32_t poly = svmla_n_f32_x (pg, svmul_x (pg, r, sv_f32 (d->log10_hi)), - r, d->log10_lo); + svfloat32_t poly = svmla_lane (svmul_x (pg, r, d->log10_hi), r, log10, 3); poly = svmla_x (pg, poly, p16, r2); - svfloat32_t y = svmla_x (pg, svsub_x (pg, scale, 1.0f), poly, scale); - + svbool_t cmp = svacge_n_f32 (svptrue_b32 (), x, d->special_bound); /* Fallback to special case for lanes with overflow. */ - if (__glibc_unlikely (svptest_any (pg, cmp))) - return svsel_f32 (cmp, special_case (poly, n, e, cmp, scale, d), y); + if (__glibc_unlikely (svptest_any (cmp, cmp))) + return special_case (poly, n, scale, cmp, &d->special_data); - return y; + return svmla_x (pg, svsub_x (pg, scale, 1.0f), poly, scale); } diff --git a/sysdeps/aarch64/fpu/exp2_advsimd.c b/sysdeps/aarch64/fpu/exp2_advsimd.c index eefc66d9508..f6c345b7c97 100644 --- a/sysdeps/aarch64/fpu/exp2_advsimd.c +++ b/sysdeps/aarch64/fpu/exp2_advsimd.c @@ -19,25 +19,38 @@ #include "v_math.h" #include "poly_advsimd_f64.h" +#include "v_exp_special_case_inline.h" + +/* Value of |x| above which scale overflows without special treatment. */ +#define SpecialBound 0x1.ffp+9 /* log2(2^1022) = 1022.00. */ + +/* Value of n above which scale overflows even with special treatment. */ +#define ScaleBound 0x1.4p+10 /* 1280.0. */ #define N (1 << V_EXP_TABLE_BITS) #define IndexMask (N - 1) -#define BigBound 1022.0 -#define UOFlowBound 1280.0 -#define TinyBound 0x2000000000000000 /* asuint64(0x1p-511). */ static const struct data { + struct v_exp_special_data special_data; float64x2_t poly[4]; - float64x2_t shift, scale_big_bound, scale_uoflow_bound; + float64x2_t shift, special_bound, scale_thresh; } data = { + .special_data = V_EXP_SPECIAL_DATA, /* Coefficients are computed using Remez algorithm with minimisation of the absolute error. */ .poly = { V2 (0x1.62e42fefa3686p-1), V2 (0x1.ebfbdff82c241p-3), V2 (0x1.c6b09b16de99ap-5), V2 (0x1.3b2abf5571ad8p-7) }, .shift = V2 (0x1.8p52 / N), - .scale_big_bound = V2 (BigBound), - .scale_uoflow_bound = V2 (UOFlowBound), + /* Implementation triggers special case handling as soon as the scale + overflows, which is earlier than exp2's overflow bound + `log2(DBL_MAX) ≈ 1024.0` or underflow bound + `log2(DBL_TRUE_MIN) = -1074.0`. + The absolute comparison catches all these cases efficiently, although + there is a small window where special cases are triggered + unnecessarily. */ + .special_bound = V2 (SpecialBound), + .scale_thresh = V2 (ScaleBound), }; static inline uint64x2_t @@ -47,35 +60,14 @@ lookup_sbits (uint64x2_t i) __v_exp_data[i[1] & IndexMask] }; } -# define SpecialOffset 0x6000000000000000 /* 0x1p513. */ -/* SpecialBias1 + SpecialBias1 = asuint(1.0). */ -# define SpecialBias1 0x7000000000000000 /* 0x1p769. */ -# define SpecialBias2 0x3010000000000000 /* 0x1p-254. */ - -static inline float64x2_t VPCS_ATTR -special_case (float64x2_t s, float64x2_t y, float64x2_t n, - const struct data *d) -{ - /* 2^(n/N) may overflow, break it up into s1*s2. */ - uint64x2_t b = vandq_u64 (vclezq_f64 (n), v_u64 (SpecialOffset)); - float64x2_t s1 = vreinterpretq_f64_u64 (vsubq_u64 (v_u64 (SpecialBias1), b)); - float64x2_t s2 = vreinterpretq_f64_u64 ( - vaddq_u64 (vsubq_u64 (vreinterpretq_u64_f64 (s), v_u64 (SpecialBias2)), b)); - uint64x2_t cmp = vcagtq_f64 (n, d->scale_uoflow_bound); - float64x2_t r1 = vmulq_f64 (s1, s1); - float64x2_t r0 = vmulq_f64 (vfmaq_f64 (s2, s2, y), s1); - return vbslq_f64 (cmp, r1, r0); -} - /* Fast vector implementation of exp2. Maximum measured error is 1.65 ulp. _ZGVnN2v_exp2(-0x1.4c264ab5b559bp-6) got 0x1.f8db0d4df721fp-1 want 0x1.f8db0d4df721dp-1. */ -VPCS_ATTR -float64x2_t V_NAME_D1 (exp2) (float64x2_t x) +float64x2_t VPCS_ATTR V_NAME_D1 (exp2) (float64x2_t x) { const struct data *d = ptr_barrier (&data); - uint64x2_t cmp = vcagtq_f64 (x, d->scale_big_bound); + uint64x2_t cmp = vcagtq_f64 (x, d->special_bound); /* n = round(x/N). */ float64x2_t z = vaddq_f64 (d->shift, x); @@ -85,17 +77,17 @@ float64x2_t V_NAME_D1 (exp2) (float64x2_t x) /* r = x - n/N. */ float64x2_t r = vsubq_f64 (x, n); - /* s = 2^(n/N). */ + /* scale = 2^(n/N). */ uint64x2_t e = vshlq_n_u64 (u, 52 - V_EXP_TABLE_BITS); u = lookup_sbits (u); - float64x2_t s = vreinterpretq_f64_u64 (vaddq_u64 (u, e)); + float64x2_t scale = vreinterpretq_f64_u64 (vaddq_u64 (u, e)); - /* y ~ exp2(r) - 1. */ + /* poly ~ exp2(r) - 1. */ float64x2_t r2 = vmulq_f64 (r, r); - float64x2_t y = v_pairwise_poly_3_f64 (r, r2, d->poly); - y = vmulq_f64 (r, y); + float64x2_t poly = v_pairwise_poly_3_f64 (r, r2, d->poly); + poly = vmulq_f64 (r, poly); if (__glibc_unlikely (v_any_u64 (cmp))) - return special_case (s, y, n, d); - return vfmaq_f64 (s, s, y); + return exp_special (poly, n, scale, d->scale_thresh, &d->special_data); + return vfmaq_f64 (scale, scale, poly); } diff --git a/sysdeps/aarch64/fpu/exp2_sve.c b/sysdeps/aarch64/fpu/exp2_sve.c index 892939d1e27..a6f002f9b0c 100644 --- a/sysdeps/aarch64/fpu/exp2_sve.c +++ b/sysdeps/aarch64/fpu/exp2_sve.c @@ -18,57 +18,48 @@ . */ #include "sv_math.h" +#include "sv_exp_special_inline.h" -#define BigBound 1022 -#define UOFlowBound 1280 +/* Value of |x| above which scale overflows without special treatment. + log2(2^(1022 + 1/128)) ~ 1022.00. */ +#define SpecialBound 0x1.ff01p+9 + +/* Value of n above which scale overflows even with special treatment. */ +#define ScaleBound 1280 static const struct data { double c2, c4; double c0, c1, c3; - double shift, big_bound, uoflow_bound; + double shift, special_bound; + struct sv_exp_special_data special_data; } data = { + .special_data = SV_EXP_SPECIAL_DATA, /* Coefficients are computed using Remez algorithm with minimisation of the absolute error. */ - .c0 = 0x1.62e42fefa39efp-1, .c1 = 0x1.ebfbdff82a31bp-3, - .c2 = 0x1.c6b08d706c8a5p-5, .c3 = 0x1.3b2ad2ff7d2f3p-7, - .c4 = 0x1.5d8761184beb3p-10, .shift = 0x1.800000000ffc0p+46, - .uoflow_bound = UOFlowBound, .big_bound = BigBound, + .c0 = 0x1.62e42fefa39efp-1, + .c1 = 0x1.ebfbdff82a31bp-3, + .c2 = 0x1.c6b08d706c8a5p-5, + .c3 = 0x1.3b2ad2ff7d2f3p-7, + .c4 = 0x1.5d8761184beb3p-10, + .shift = 0x1.800000000ffc0p+46, + .special_bound = SpecialBound, }; -#define SpecialOffset 0x6000000000000000 /* 0x1p513. */ -/* SpecialBias1 + SpecialBias1 = asuint(1.0). */ -#define SpecialBias1 0x7000000000000000 /* 0x1p769. */ -#define SpecialBias2 0x3010000000000000 /* 0x1p-254. */ - -/* Update of both special and non-special cases, if any special case is - detected. */ -static inline svfloat64_t -special_case (svbool_t pg, svfloat64_t s, svfloat64_t y, svfloat64_t n, - const struct data *d) +static svfloat64_t NOINLINE +special_exp (svfloat64_t poly, svfloat64_t scale, svfloat64_t n, svfloat64_t z, + const struct sv_exp_special_data *ds) { - /* s=2^n may overflow, break it up into s=s1*s2, - such that exp = s + s*y can be computed as s1*(s2+s2*y) - and s1*s1 overflows only if n>0. */ - - /* If n<=0 then set b to 0x6, 0 otherwise. */ - svbool_t p_sign = svcmple (pg, n, 0.0); /* n <= 0. */ - svuint64_t b = svdup_u64_z (p_sign, SpecialOffset); - - /* Set s1 to generate overflow depending on sign of exponent n. */ - svfloat64_t s1 = svreinterpret_f64 (svsubr_x (pg, b, SpecialBias1)); - /* Offset s to avoid overflow in final result if n is below threshold. */ - svfloat64_t s2 = svreinterpret_f64 ( - svadd_x (pg, svsub_x (pg, svreinterpret_u64 (s), SpecialBias2), b)); - - /* |n| > 1280 => 2^(n) overflows. */ - svbool_t p_cmp = svacle (pg, n, d->uoflow_bound); - - svfloat64_t r1 = svmul_x (svptrue_b64 (), s1, s1); - svfloat64_t r2 = svmla_x (pg, s2, s2, y); - svfloat64_t r0 = svmul_x (svptrue_b64 (), r2, s1); - - return svsel (p_cmp, r0, r1); + /* FEXPA zeroes the sign bit, however the sign is meaningful to the + special case function so needs to be copied. + e = sign bit of u << 46. */ + svuint64_t u = svreinterpret_u64 (z); + svuint64_t e = svand_x (svptrue_b64 (), svlsl_x (svptrue_b64 (), u, 46), + 0x8000000000000000); + /* Copy sign to scale. */ + scale = svreinterpret_f64 ( + svadd_x (svptrue_b64 (), e, svreinterpret_u64 (scale))); + return special_case (scale, poly, n, ds); } /* Fast vector implementation of exp2. @@ -78,11 +69,10 @@ special_case (svbool_t pg, svfloat64_t s, svfloat64_t y, svfloat64_t n, svfloat64_t SV_NAME_D1 (exp2) (svfloat64_t x, svbool_t pg) { const struct data *d = ptr_barrier (&data); - svbool_t special = svacge (pg, x, d->big_bound); svfloat64_t z = svadd_x (svptrue_b64 (), x, d->shift); - svfloat64_t n = svsub_x (svptrue_b64 (), z, d->shift); - svfloat64_t r = svsub_x (svptrue_b64 (), x, n); + svfloat64_t n = svsub_x (pg, z, d->shift); + svfloat64_t r = svsub_x (pg, x, n); svfloat64_t scale = svexpa (svreinterpret_u64 (z)); @@ -95,19 +85,13 @@ svfloat64_t SV_NAME_D1 (exp2) (svfloat64_t x, svbool_t pg) svfloat64_t p34 = svmla_lane (sv_f64 (d->c3), r, c24, 1); svfloat64_t p = svmla_x (pg, p12, p34, r2); p = svmad_x (pg, p, r, d->c0); - svfloat64_t y = svmul_x (svptrue_b64 (), r, p); - - /* Assemble exp2(x) = exp2(r) * scale. */ - if (__glibc_unlikely (svptest_any (pg, special))) - { - /* FEXPA zeroes the sign bit, however the sign is meaningful to the - special case function so needs to be copied. - e = sign bit of u << 46. */ - svuint64_t e = svand_x (pg, svlsl_x (pg, svreinterpret_u64 (z), 46), - 0x8000000000000000); - scale = svreinterpret_f64 (svadd_x (pg, e, svreinterpret_u64 (scale))); - return special_case (pg, scale, y, n, d); - } - - return svmla_x (pg, scale, scale, y); + svfloat64_t poly = svmul_x (svptrue_b64 (), r, p); + + svbool_t special = svacge (svptrue_b64 (), x, d->special_bound); + /* Assemble result as exp(x) = 2^n * exp(r). If |x| > Thresh the + multiplication may overflow, so use special case routine. */ + if (__glibc_unlikely (svptest_any (special, special))) + return special_exp (poly, scale, n, z, &d->special_data); + + return svmla_x (pg, scale, scale, poly); } diff --git a/sysdeps/aarch64/fpu/exp2f_advsimd.c b/sysdeps/aarch64/fpu/exp2f_advsimd.c index f2174379030..bda7d8227d0 100644 --- a/sysdeps/aarch64/fpu/exp2f_advsimd.c +++ b/sysdeps/aarch64/fpu/exp2f_advsimd.c @@ -18,44 +18,39 @@ . */ #include "v_math.h" +#include "v_expf_special_inline.h" + +/* Value of |x| above which scale overflows without special treatment. */ +#define SpecialBound 0x1.fap+6 /* = 126.50. */ static const struct data { + struct v_expf_special_data special_data; float32x4_t c1, c3; - uint32x4_t exponent_bias, special_offset, special_bias; - float32x4_t scale_thresh, special_bound; + uint32x4_t exponent_bias; + float32x4_t special_bound; float c0, c2, c4, zero; } data = { - /* maxerr: 1.962 ulp. */ + .special_data = V_EXPF_SPECIAL_DATA, .c0 = 0x1.59977ap-10f, .c1 = V4 (0x1.3ce9e4p-7f), .c2 = 0x1.c6bd32p-5f, .c3 = V4 (0x1.ebf9bcp-3f), .c4 = 0x1.62e422p-1f, .exponent_bias = V4 (0x3f800000), - .special_offset = V4 (0x82000000), - .special_bias = V4 (0x7f000000), - .special_bound = V4 (126.0f), - .scale_thresh = V4 (192.0f), + /* Implementation triggers special case handling as soon as the scale + overflows, which is earlier than exp2f's true overflow bound + `log2(FLT_MAX) ≈ 128.0` or underflow bound `log2(FLT_TRUE_MIN) = -149.0`. + The absolute comparison catches all these cases efficiently, although + there is a small window where special cases are triggered + unnecessarily. */ + .special_bound = V4 (SpecialBound), }; -static float32x4_t VPCS_ATTR NOINLINE -special_case (float32x4_t poly, float32x4_t n, uint32x4_t e, uint32x4_t cmp1, - float32x4_t scale, const struct data *d) -{ - /* 2^n may overflow, break it up into s1*s2. */ - uint32x4_t b = vandq_u32 (vclezq_f32 (n), d->special_offset); - float32x4_t s1 = vreinterpretq_f32_u32 (vaddq_u32 (b, d->special_bias)); - float32x4_t s2 = vreinterpretq_f32_u32 (vsubq_u32 (e, b)); - uint32x4_t cmp2 = vcagtq_f32 (n, d->scale_thresh); - float32x4_t r2 = vmulq_f32 (s1, s1); - float32x4_t r1 = vmulq_f32 (vfmaq_f32 (s2, poly, s2), s1); - /* Similar to r1 but avoids double rounding in the subnormal range. */ - float32x4_t r0 = vfmaq_f32 (scale, poly, scale); - float32x4_t r = vbslq_f32 (cmp1, r1, r0); - return vbslq_f32 (cmp2, r2, r); -} - +/* Single-precision vector exp2f routine. + The maximum error is 1.47 +0.5 ULP: + _ZGVnN4v_exp2f(0x1.7fdccep+0) got 0x1.69e764p+1 + want 0x1.69e768p+1. */ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp2) (float32x4_t x) { const struct data *d = ptr_barrier (&data); @@ -67,7 +62,7 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp2) (float32x4_t x) uint32x4_t e = vshlq_n_u32 (vreinterpretq_u32_s32 (vcvtaq_s32_f32 (x)), 23); float32x4_t scale = vreinterpretq_f32_u32 (vaddq_u32 (e, d->exponent_bias)); - uint32x4_t cmp = vcagtq_f32 (n, d->special_bound); + uint32x4_t cmp = vcageq_f32 (x, d->special_bound); float32x4_t c024 = vld1q_f32 (&d->c0); float32x4_t r2 = vmulq_f32 (r, r); @@ -78,7 +73,7 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp2) (float32x4_t x) float32x4_t poly = vfmaq_f32 (p, q, r2); if (__glibc_unlikely (v_any_u32 (cmp))) - return special_case (poly, n, e, cmp, scale, d); + return expf_special (poly, n, e, cmp, scale, &d->special_data); return vfmaq_f32 (scale, poly, scale); } diff --git a/sysdeps/aarch64/fpu/exp2f_sve.c b/sysdeps/aarch64/fpu/exp2f_sve.c index 85968351728..b71f472d449 100644 --- a/sysdeps/aarch64/fpu/exp2f_sve.c +++ b/sysdeps/aarch64/fpu/exp2f_sve.c @@ -19,11 +19,18 @@ #include "sv_math.h" -#define Thres 0x1.5d5e2ap+6f +/* For x < -SpecialBound, the result is subnormal and not handled + correctly by FEXPA. */ +#define SpecialBound 0x1.f8p6f /* log2(2^126) = 126.0f. */ + +/* Values of x over which exp overflows or underflows. */ +#define InfBound 0x1.0p7f /* 128.0f. */ +#define ZeroBound -0x1.2a8p7f /* -149.0f. */ static const struct data { - float c0, c1, shift, thres; + float c0, c1, shift, special_bound; + float inf_bound, zero_bound; } data = { /* Coefficients generated using Remez algorithm with minimisation of relative error. */ @@ -31,16 +38,16 @@ static const struct data .c1 = 0x1.ebfbe0p-3, /* 1.5*2^17 + 127. */ .shift = 0x1.803f8p17f, - /* Roughly 87.3. For x < -Thres, the result is subnormal and not handled - correctly by FEXPA. */ - .thres = Thres, + .special_bound = SpecialBound, + .inf_bound = InfBound, + .zero_bound = ZeroBound, }; static inline svfloat32_t sv_exp2f_inline (svfloat32_t x, const svbool_t pg, const struct data *d) { /* exp2(x) = 2^n (1 + poly(r)), with 1 + poly(r) in [1/sqrt(2),sqrt(2)] - x = n + r, with r in [-1/2, 1/2]. */ + x = n + r, with r in [-1/2, 1/2]. */ svfloat32_t z = svadd_x (svptrue_b32 (), x, d->shift); svfloat32_t n = svsub_x (svptrue_b32 (), z, d->shift); svfloat32_t r = svsub_x (svptrue_b32 (), x, n); @@ -54,21 +61,35 @@ sv_exp2f_inline (svfloat32_t x, const svbool_t pg, const struct data *d) } static svfloat32_t NOINLINE -special_case (svfloat32_t x, svbool_t special, const struct data *d) +special_case (svfloat32_t x, svbool_t pg, svbool_t special, + const struct data *d) { - return sv_call_f32 (exp2f, x, sv_exp2f_inline (x, svptrue_b32 (), d), - special); + /* This deals with overflow and underflow in exponential for special case + lanes. */ + svbool_t is_inf = svcmpgt (pg, x, d->inf_bound); + svbool_t is_zero = svcmplt (pg, x, d->zero_bound); + + /* The input `x` is further reduced (to `x/2`) to allow for accurate + approximation on the interval `x > SpecialBound = 126.0`. */ + x = svmul_x (special, x, 0.5); + + /* Computes exp(x/2), and set lanes with underflow/overflow. */ + svfloat32_t half_exp = sv_exp2f_inline (x, svptrue_b32 (), d); + half_exp = svmul_m (special, half_exp, half_exp); + half_exp = svsel (is_inf, sv_f32 (INFINITY), half_exp); + + return svsel (is_zero, sv_f32 (0), half_exp); } /* Single-precision SVE exp2f routine, based on the FEXPA instruction. - Worst case error is 1.09 ULPs. - _ZGVsMxv_exp2f (0x1.9a2a94p-1) got 0x1.be1054p+0 - want 0x1.be1052p+0. */ + Worst case error is 2.87 +0.50 ULP. + _ZGVsMxv_exp2f (0x1.fbcb78p+6) got 0x1.ee1d32p+126 + want 0x1.ee1d2cp+126. */ svfloat32_t SV_NAME_F1 (exp2) (svfloat32_t x, const svbool_t pg) { const struct data *d = ptr_barrier (&data); - svbool_t special = svacgt (pg, x, d->thres); + svbool_t special = svacgt (pg, x, d->special_bound); if (__glibc_unlikely (svptest_any (special, special))) - return special_case (x, special, d); + return special_case (x, pg, special, d); return sv_exp2f_inline (x, pg, d); } diff --git a/sysdeps/aarch64/fpu/exp2m1_advsimd.c b/sysdeps/aarch64/fpu/exp2m1_advsimd.c index 684c8125f38..3e3f608da5b 100644 --- a/sysdeps/aarch64/fpu/exp2m1_advsimd.c +++ b/sysdeps/aarch64/fpu/exp2m1_advsimd.c @@ -18,12 +18,13 @@ . */ #include "v_math.h" +#include "v_exp_special_case_inline.h" /* Value of |x| above which scale overflows without special treatment. */ -#define SpecialBound 1022.0 +#define SpecialBound 0x1.ff8p+9 /* log2(2^1023) = 1023.00. */ /* Value of n above which scale overflows even with special treatment. */ -#define ScaleBound 1280.0 +#define ScaleBound 0x1.4p+10 /* 1280.0. */ /* 87/256, value of x under which table lookup is used for 2^x-1. */ #define TableBound 0x1.5bfffffffffffp-2 @@ -34,13 +35,16 @@ static const struct data { - uint64x2_t exponent_bias, special_offset, special_bias, special_bias2, - sm1_tbl_off, sm1_tbl_mask; + struct v_exp_special_data special_data; + double log2_lo, c2, c4, c6; + uint64x2_t sm1_tbl_off, sm1_tbl_mask; float64x2_t scale_thresh, special_bound, shift, rnd2zero; float64x2_t log2_hi, c1, c3, c5; - double log2_lo, c2, c4, c6; uint64_t scalem1[88]; } data = { + /* Special case data from helper. */ + .special_data = V_EXP_SPECIAL_DATA, + /* Coefficients generated using remez's algorithm for exp2m1(x). */ .log2_hi = V2 (0x1.62e42fefa39efp-1), .log2_lo = 0x1.abc9e3b39803f3p-56, @@ -50,16 +54,19 @@ static const struct data .c4 = 0x1.5d1d37eb33b15p-10, .c5 = V2 (0x1.423f35f371d9ap-13), .c6 = 0x1.e7d57ad9a5f93p-5, - .exponent_bias = V2 (0x3ff0000000000000), - .special_offset = V2 (0x6000000000000000), /* 0x1p513. */ - .special_bias = V2 (0x7000000000000000), /* 0x1p769. */ - .special_bias2 = V2 (0x3010000000000000), /* 0x1p-254. */ - .scale_thresh = V2 (ScaleBound), + /* Implementation triggers special case handling as soon as the scale + overflows, which is earlier than exp2m1's overflow bound + `log2(DBL_MAX) ≈ 1024.0` or underflow bound + `log2(DBL_TRUE_MIN) = -1074.0`. + The absolute comparison catches all these cases efficiently, although + there is a small window where special cases are triggered + unnecessarily. */ .special_bound = V2 (SpecialBound), .shift = V2 (0x1.8p52 / N), .rnd2zero = V2 (-0x1p-8), .sm1_tbl_off = V2 (24), .sm1_tbl_mask = V2 (0x3f), + .scale_thresh = V2 (ScaleBound), /* Table containing 2^x - 1, for 2^x values close to 1. The table holds values of 2^(i/128) - 1, computed in @@ -122,22 +129,6 @@ lookup_sm1bits (float64x2_t x, uint64x2_t u, const struct data *d) return vreinterpretq_f64_u64 (sm1); } -static inline VPCS_ATTR float64x2_t -special_case (float64x2_t poly, float64x2_t n, uint64x2_t e, float64x2_t scale, - const struct data *d) -{ - /* 2^n may overflow, break it up into s1*s2. */ - uint64x2_t b = vandq_u64 (vclezq_f64 (n), d->special_offset); - float64x2_t s1 = vreinterpretq_f64_u64 (vsubq_u64 (d->special_bias, b)); - float64x2_t s2 = vreinterpretq_f64_u64 (vaddq_u64 ( - vsubq_u64 (vreinterpretq_u64_f64 (scale), d->special_bias2), b)); - uint64x2_t cmp2 = vcagtq_f64 (n, d->scale_thresh); - float64x2_t r1 = vmulq_f64 (s1, s1); - float64x2_t r2 = vmulq_f64 (vfmaq_f64 (s2, poly, s2), s1); - /* Similar to r1 but avoids double rounding in the subnormal range. */ - return vsubq_f64 (vbslq_f64 (cmp2, r1, r2), v_f64 (1.0f)); -} - /* Double-precision vector exp2(x) - 1 function. The maximum error is 2.55 + 0.5 ULP. _ZGVnN2v_exp2m1 (0x1.1113e87a035ap-8) got 0x1.7b1d06f0a7d36p-9 @@ -188,7 +179,11 @@ VPCS_ATTR float64x2_t V_NAME_D1 (exp2m1) (float64x2_t x) /* Fallback to special case for lanes with overflow. */ if (__glibc_unlikely (v_any_u64 (cmp))) - return vbslq_f64 (cmp, special_case (poly, n, e, scale, d), y); - + { + float64x2_t special + = (exp_special (poly, n, scale, d->scale_thresh, &d->special_data)); + float64x2_t specialm1 = vsubq_f64 (special, v_f64 (1.0f)); + return vbslq_f64 (cmp, specialm1, y); + } return y; } diff --git a/sysdeps/aarch64/fpu/exp2m1_sve.c b/sysdeps/aarch64/fpu/exp2m1_sve.c index 8bacad8874c..ef13c26ecb4 100644 --- a/sysdeps/aarch64/fpu/exp2m1_sve.c +++ b/sysdeps/aarch64/fpu/exp2m1_sve.c @@ -18,28 +18,24 @@ . */ #include "sv_math.h" +#include "sv_exp_special_inline.h" -/* Value of |x| above which scale overflows without special treatment. */ -#define SpecialBound 1022.0 - -/* Value of n above which scale overflows even with special treatment. */ -#define ScaleBound 1280.0 +/* Value of |x| above which scale overflows without special treatment. + log2(2^(1023 + 1/128)) ~ 1023.01. */ +#define SpecialBound 0x1.ff81p+9 /* 87/256, value of x under which table lookup is used for 2^x-1. */ -#define TableBound 0x1.5bfffffffffffp-2 - -/* Number of bits for each value in the table. */ -#define N (1 << V_EXP_TABLE_BITS) -#define IndexMask (N - 1) +#define FexpaBound 0x1.fp-3 /* 31*(log2(2)/128) = 0.2421875. */ static const struct data { - double shift, rnd2zero; + struct sv_exp_special_data special_data; + double shift; double c1, c3, c5; - double c0, c2, c4; - uint64_t sm1_tbl_mask, sm1_tbl_off; - uint64_t scalem1[88]; + double c0, c2, c4, special_bound; + uint64_t scalem1[32]; } data = { + .special_data = SV_EXP_SPECIAL_DATA, /* Generated using fpminimax. */ .c0 = 0x1.62e42fefa39efp-1, .c1 = 0x1.ebfbdff82c58ep-3, @@ -47,102 +43,43 @@ static const struct data .c3 = 0x1.3b2ab6fc45f33p-7, .c4 = 0x1.5d86c0ff8618dp-10, .c5 = 0x1.4301374d5d2f5p-13, - .shift = 0x1.8p52 / N, - .sm1_tbl_mask = 0x3f, - .sm1_tbl_off = 24, - .rnd2zero = -0x1p-8, - /* Table containing 2^x - 1, for 2^x values close to 1. - The table holds values of 2^(i/128) - 1, computed in - arbitrary precision. - The 1st half contains values associated to i=0..43. - The 2nd half contains values associated to i=-44..-1. */ - .scalem1= { - 0x0000000000000000, 0x3f763da9fb33356e, 0x3f864d1f3bc03077, - 0x3f90c57a1b9fe12f, 0x3f966c34c5615d0f, 0x3f9c1aca777db772, - 0x3fa0e8a30eb37901, 0x3fa3c7d958de7069, 0x3fa6ab0d9f3121ec, - 0x3fa992456e48fee8, 0x3fac7d865a7a3440, 0x3faf6cd5ffda635e, - 0x3fb1301d0125b50a, 0x3fb2abdc06c31cc0, 0x3fb429aaea92ddfb, - 0x3fb5a98c8a58e512, 0x3fb72b83c7d517ae, 0x3fb8af9388c8de9c, - 0x3fba35beb6fcb754, 0x3fbbbe084045cd3a, 0x3fbd4873168b9aa8, - 0x3fbed5022fcd91cc, 0x3fc031dc431466b2, 0x3fc0fa4c8beee4b1, - 0x3fc1c3d373ab11c3, 0x3fc28e727d9531fa, 0x3fc35a2b2f13e6e9, - 0x3fc426ff0fab1c05, 0x3fc4f4efa8fef709, 0x3fc5c3fe86d6cc80, - 0x3fc6942d3720185a, 0x3fc7657d49f17ab1, 0x3fc837f0518db8a9, - 0x3fc90b87e266c18a, 0x3fc9e0459320b7fa, 0x3fcab62afc94ff86, - 0x3fcb8d39b9d54e55, 0x3fcc6573682ec32c, 0x3fcd3ed9a72cffb7, - 0x3fce196e189d4724, 0x3fcef5326091a112, 0x3fcfd228256400dd, - 0x3fd0582887dcb8a8, 0x3fd0c7d76542a25b, 0xbfcb23213cc8e86c, - 0xbfca96ecd0deb7c4, 0xbfca09f58086c6c2, 0xbfc97c3a3cd7e119, - 0xbfc8edb9f5703dc0, 0xbfc85e7398737374, 0xbfc7ce6612886a6d, - 0xbfc73d904ed74b33, 0xbfc6abf137076a8e, 0xbfc61987b33d329e, - 0xbfc58652aa180903, 0xbfc4f25100b03219, 0xbfc45d819a94b14b, - 0xbfc3c7e359c9266a, 0xbfc331751ec3a814, 0xbfc29a35c86a9b1a, - 0xbfc20224341286e4, 0xbfc1693f3d7be6da, 0xbfc0cf85bed0f8b7, - 0xbfc034f690a387de, 0xbfbf332113d56b1f, 0xbfbdfaa500017c2d, - 0xbfbcc0768d4175a6, 0xbfbb84935fc8c257, 0xbfba46f918837cb7, - 0xbfb907a55511e032, 0xbfb7c695afc3b424, 0xbfb683c7bf93b074, - 0xbfb53f391822dbc7, 0xbfb3f8e749b3e342, 0xbfb2b0cfe1266bd4, - 0xbfb166f067f25cfe, 0xbfb01b466423250a, 0xbfad9b9eb0a5ed76, - 0xbfaafd11874c009e, 0xbfa85ae0438b37cb, 0xbfa5b505d5b6f268, - 0xbfa30b7d271980f7, 0xbfa05e4119ea5d89, 0xbf9b5a991288ad16, - 0xbf95f134923757f3, 0xbf90804a4c683d8f, 0xbf860f9f985bc9f4, - 0xbf761eea3847077b, - } + .shift = 0x1.800000000ffc0p+46, /* 1.5*2^46+1023. */ + .special_bound = SpecialBound, + /* Table emulating FEXPA - 1, for values of FEXPA close to 1. + The table holds values of 2^(i/64) - 1, computed in arbitrary precision. + The 1st half contains values associated to i=0..+15. + The 2nd half contains values associated to i=0..-15. */ + .scalem1 = { + 0x0000000000000000, 0x3f864d1f3bc03077, 0x3f966c34c5615d0f, + 0x3fa0e8a30eb37901, 0x3fa6ab0d9f3121ec, 0x3fac7d865a7a3440, + 0x3fb1301d0125b50a, 0x3fb429aaea92ddfb, 0x3fb72b83c7d517ae, + 0x3fba35beb6fcb754, 0x3fbd4873168b9aa8, 0x3fc031dc431466b2, + 0x3fc1c3d373ab11c3, 0x3fc35a2b2f13e6e9, 0x3fc4f4efa8fef709, + 0x3fc6942d3720185a, 0x0000000000000000, 0xbfc331751ec3a814, + 0xbfc20224341286e4, 0xbfc0cf85bed0f8b7, 0xbfbf332113d56b1f, + 0xbfbcc0768d4175a6, 0xbfba46f918837cb7, 0xbfb7c695afc3b424, + 0xbfb53f391822dbc7, 0xbfb2b0cfe1266bd4, 0xbfb01b466423250a, + 0xbfaafd11874c009e, 0xbfa5b505d5b6f268, 0xbfa05e4119ea5d89, + 0xbf95f134923757f3, 0xbf860f9f985bc9f4, + }, }; -static inline svuint64_t -lookup_sbits (svbool_t pg, svuint64_t indices) +static svfloat64_t NOINLINE +special_m1 (svbool_t special, svfloat64_t y, svfloat64_t z, svfloat64_t scale, + svfloat64_t poly, svfloat64_t n, + const struct sv_exp_special_data *ds) { - /* Mask indices to valid range. */ - svuint64_t masked = svand_z (pg, indices, svdup_n_u64 (IndexMask)); - return svld1_gather_index (pg, __v_exp_data, masked); -} - -static inline svfloat64_t -lookup_sm1bits (svbool_t pg, svfloat64_t x, svuint64_t u, const struct data *d) -{ - /* Extract sign bit and use as offset into table. */ - svbool_t signBit = svcmplt_f64 (pg, x, sv_f64 (d->rnd2zero)); - svuint64_t base_idx = svand_x (pg, u, d->sm1_tbl_mask); - svuint64_t idx = svadd_m (signBit, base_idx, sv_u64 (d->sm1_tbl_off)); - - /* Fall back to table lookup for 2^x - 1, when x is close to zero to - avoid large errors. */ - svuint64_t sm1 = svld1_gather_index (svptrue_b64 (), d->scalem1, idx); - return svreinterpret_f64 (sm1); -} - -#define SpecialOffset 0x6000000000000000 /* 0x1p513. */ -/* SpecialBias1 + SpecialBias1 = asuint(1.0). */ -#define SpecialBias1 0x7000000000000000 /* 0x1p769. */ -#define SpecialBias2 0x3010000000000000 /* 0x1p-254. */ - -static inline svfloat64_t -special_case (svfloat64_t s, svfloat64_t y, svfloat64_t n, - const struct data *d) -{ - /* s=2^n may overflow, break it up into s=s1*s2, - such that exp = s + s*y can be computed as s1*(s2+s2*y) - and s1*s1 overflows only if n>0. */ - svbool_t p_sign = svcmple (svptrue_b64 (), n, 0.0); /* n <= 0. */ - svuint64_t b = svdup_u64_z (p_sign, SpecialOffset); - - /* Set s1 to generate overflow depending on sign of exponent n. */ - svfloat64_t s1 - = svreinterpret_f64 (svsubr_x (svptrue_b64 (), b, SpecialBias1)); - /* Offset s to avoid overflow in final result if n is below threshold. */ - svfloat64_t s2 = svreinterpret_f64 (svadd_x ( - svptrue_b64 (), - svsub_x (svptrue_b64 (), svreinterpret_u64 (s), SpecialBias2), b)); - - /* |n| > 1280 => 2^(n) overflows. */ - svbool_t p_cmp = svacle (svptrue_b64 (), n, ScaleBound); - - svfloat64_t r1 = svmul_x (svptrue_b64 (), s1, s1); - svfloat64_t r2 = svmla_x (svptrue_b64 (), s2, s2, y); - svfloat64_t r0 = svmul_x (svptrue_b64 (), r2, s1); - - return svsub_x (svptrue_b64 (), svsel (p_cmp, r0, r1), 1.0); + /* FEXPA zeroes the sign bit, however the sign is meaningful to the + special case function so needs to be copied. + e = sign bit of u << 46. */ + svuint64_t u = svreinterpret_u64 (z); + svuint64_t e = svand_x (svptrue_b64 (), svlsl_x (svptrue_b64 (), u, 46), + 0x8000000000000000); + /* Copy sign to scale. */ + scale = svreinterpret_f64 ( + svadd_x (svptrue_b64 (), e, svreinterpret_u64 (scale))); + svfloat64_t special_result = special_case (scale, poly, n, ds); + return svsel (special, svsub_x (svptrue_b64 (), special_result, 1.0), y); } /* Double-precision SVE exp2(x) - 1. @@ -155,7 +92,6 @@ svfloat64_t SV_NAME_D1 (exp2m1) (svfloat64_t x, svbool_t pg) x = n + r, with r in [-1/2N, 1/2N]. n is a floating point number, multiple of 1/N. */ const struct data *d = ptr_barrier (&data); - svbool_t special = svacge (pg, x, SpecialBound); svfloat64_t z = svadd_x (pg, x, d->shift); svuint64_t u = svreinterpret_u64 (z); @@ -165,9 +101,7 @@ svfloat64_t SV_NAME_D1 (exp2m1) (svfloat64_t x, svbool_t pg) svfloat64_t r2 = svmul_x (svptrue_b64 (), r, r); /* Look up table to calculate 2^n. */ - svuint64_t e = svlsl_x (pg, u, 52 - V_EXP_TABLE_BITS); - svuint64_t scale_bits = lookup_sbits (pg, u); - svfloat64_t scale = svreinterpret_f64_u64 (svadd_x (pg, scale_bits, e)); + svfloat64_t scale = svexpa (u); /* Pairwise Horner scheme. */ svfloat64_t c35 = svld1rq (svptrue_b64 (), &d->c3); @@ -180,18 +114,37 @@ svfloat64_t SV_NAME_D1 (exp2m1) (svfloat64_t x, svbool_t pg) svfloat64_t p05 = svmla_x (pg, p01, r2, p25); svfloat64_t poly = svmul_x (pg, p05, r); - /* Use table to gather scalem1 for small values of x. */ - svbool_t is_small = svaclt (pg, x, TableBound); svfloat64_t scalem1 = svsub_x (pg, scale, sv_f64 (1.0)); + /* For small values, use a lookup table for a more accurate scalem1. */ + svbool_t is_small = svaclt (pg, x, FexpaBound); if (svptest_any (pg, is_small)) - scalem1 = svsel_f64 (is_small, lookup_sm1bits (pg, x, u, d), scalem1); + { + /* Use the low 4 bits of the input of FEXPA as index. */ + svuint64_t base_idx = svand_x (pg, u, 0xf); - /* Construct exp2m1 = (scale - 1) + scale * poly. */ - svfloat64_t y = svmla_x (pg, scalem1, scale, poly); + /* We can use the sign of x as a fifth bit to account for the asymmetry + of e^x around 0. */ + svuint64_t signBit + = svlsl_x (pg, svlsr_x (pg, svreinterpret_u64 (x), 63), 4); + svuint64_t idx = svorr_x (pg, base_idx, signBit); - /* Fallback to special case for lanes with overflow. */ - if (__glibc_unlikely (svptest_any (pg, special))) - return svsel_f64 (special, special_case (scale, poly, n, d), y); + /* Lookup values for scale - 1 for small x. */ + svfloat64_t lookup + = svreinterpret_f64 (svld1_gather_index (is_small, d->scalem1, idx)); + + /* Select the appropriate scale - 1 value based on x. */ + scalem1 = svsel (is_small, lookup, scalem1); + } - return y; + /* Fallback to special case for lanes with overflow. */ + svbool_t special = svacgt (svptrue_b64 (), x, d->special_bound); + /* FEXPA returns nan for large inputs so we special case those. */ + if (__glibc_unlikely (svptest_any (special, special))) + { + svfloat64_t y = svmla_x (svptrue_b64 (), scalem1, scale, poly); + return special_m1 (special, y, z, scale, poly, n, &d->special_data); + } + + /* return expm1 = (scale - 1) + (scale * poly). */ + return svmla_x (pg, scalem1, scale, poly); } diff --git a/sysdeps/aarch64/fpu/exp2m1f_advsimd.c b/sysdeps/aarch64/fpu/exp2m1f_advsimd.c index b48d9c84ffa..0d84e140849 100644 --- a/sysdeps/aarch64/fpu/exp2m1f_advsimd.c +++ b/sysdeps/aarch64/fpu/exp2m1f_advsimd.c @@ -18,53 +18,39 @@ . */ #include "v_math.h" +#include "v_expf_special_inline.h" /* Value of |x| above which scale overflows without special treatment. */ -#define SpecialBound 126.0f /* rint (log2 (2^127 / (1 + sqrt (2)))). */ - -/* Value of n above which scale overflows even with special treatment. */ -#define ScaleBound 192.0f +#define SpecialBound 0x1.fep+6 /* ≈ 127.5. */ static const struct data { - uint32x4_t exponent_bias, special_offset, special_bias; - float32x4_t scale_thresh, special_bound; - float32x4_t log2_hi, c1, c3, c5; + struct v_expf_special_data special_data; float log2_lo, c2, c4, c6; + uint32x4_t exponent_bias; + float32x4_t log2_hi, c1, c3, c5; + float32x4_t special_bound; } data = { + .special_data = V_EXPF_SPECIAL_DATA, /* Coefficients generated using remez's algorithm for exp2m1f(x). */ - .log2_hi = V4 (0x1.62e43p-1), .log2_lo = -0x1.05c610p-29, - .c1 = V4 (0x1.ebfbep-3), .c2 = 0x1.c6b06ep-5, - .c3 = V4 (0x1.3b2a5cp-7), .c4 = 0x1.5da59ep-10, - .c5 = V4 (0x1.440dccp-13), .c6 = 0x1.e081d6p-17, .exponent_bias = V4 (0x3f800000), - .special_offset = V4 (0x82000000), - .special_bias = V4 (0x7f000000), - .scale_thresh = V4 (ScaleBound), + .log2_hi = V4 (0x1.62e43p-1), + .c1 = V4 (0x1.ebfbep-3), + .c3 = V4 (0x1.3b2a5cp-7), + .c5 = V4 (0x1.440dccp-13), + /* Implementation triggers special case handling as soon as the scale + overflows, which is earlier than exp2m1f's true overflow bound + `log2(FLT_MAX) ≈ 128.0` or underflow bound `log2(FLT_TRUE_MIN) = -149.0`. + The absolute comparison catches all these cases efficiently, although + there is a small window where special cases are triggered + unnecessarily. */ .special_bound = V4 (SpecialBound), }; -static float32x4_t VPCS_ATTR -special_case (float32x4_t poly, float32x4_t n, uint32x4_t e, uint32x4_t cmp1, - float32x4_t scale, const struct data *d) -{ - /* 2^n may overflow, break it up into s1*s2. */ - uint32x4_t b = vandq_u32 (vclezq_f32 (n), d->special_offset); - float32x4_t s1 = vreinterpretq_f32_u32 (vaddq_u32 (b, d->special_bias)); - float32x4_t s2 = vreinterpretq_f32_u32 (vsubq_u32 (e, b)); - uint32x4_t cmp2 = vcagtq_f32 (n, d->scale_thresh); - float32x4_t r2 = vmulq_f32 (s1, s1); - float32x4_t r1 = vmulq_f32 (vfmaq_f32 (s2, poly, s2), s1); - /* Similar to r1 but avoids double rounding in the subnormal range. */ - float32x4_t r0 = vfmaq_f32 (scale, poly, scale); - float32x4_t r = vbslq_f32 (cmp1, r1, r0); - return vsubq_f32 (vbslq_f32 (cmp2, r2, r), v_f32 (1.0f)); -} - /* Single-precision vector exp2(x) - 1 function. The maximum error is 1.76 + 0.5 ULP. _ZGVnN4v_exp2m1f (0x1.018af8p-1) got 0x1.ab2ebcp-2 @@ -77,18 +63,18 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp2m1) (float32x4_t x) x = n + r, with r in [-1/2, 1/2]. */ float32x4_t n = vrndaq_f32 (x); float32x4_t r = vsubq_f32 (x, n); - uint32x4_t e = vshlq_n_u32 (vreinterpretq_u32_s32 (vcvtaq_s32_f32 (x)), 23); + + uint32x4_t e = vshlq_n_u32 (vreinterpretq_u32_s32 (vcvtaq_s32_f32 (n)), 23); float32x4_t scale = vreinterpretq_f32_u32 (vaddq_u32 (e, d->exponent_bias)); - uint32x4_t cmp = vcagtq_f32 (n, d->special_bound); + uint32x4_t cmp = vcageq_f32 (x, d->special_bound); + /* Pairwise horner scheme. */ float32x4_t log2lo_c246 = vld1q_f32 (&d->log2_lo); float32x4_t r2 = vmulq_f32 (r, r); - - /* Pairwise horner scheme. */ - float32x4_t p56 = vfmaq_laneq_f32 (d->c5, r, log2lo_c246, 3); - float32x4_t p34 = vfmaq_laneq_f32 (d->c3, r, log2lo_c246, 2); float32x4_t p12 = vfmaq_laneq_f32 (d->c1, r, log2lo_c246, 1); + float32x4_t p34 = vfmaq_laneq_f32 (d->c3, r, log2lo_c246, 2); + float32x4_t p56 = vfmaq_laneq_f32 (d->c5, r, log2lo_c246, 3); float32x4_t p36 = vfmaq_f32 (p34, r2, p56); float32x4_t p16 = vfmaq_f32 (p12, r2, p36); float32x4_t poly @@ -97,9 +83,14 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp2m1) (float32x4_t x) float32x4_t y = vfmaq_f32 (vsubq_f32 (scale, v_f32 (1.0f)), poly, scale); + /* Fallback to special case for lanes with overflow. */ if (__glibc_unlikely (v_any_u32 (cmp))) - return vbslq_f32 (cmp, special_case (poly, n, e, cmp, scale, d), y); - + { + float32x4_t special + = (expf_special (poly, n, e, cmp, scale, &d->special_data)); + float32x4_t specialm1 = vsubq_f32 (special, v_f32 (1.0f)); + return vbslq_f32 (cmp, specialm1, y); + } return y; } libmvec_hidden_def (V_NAME_F1 (exp2m1)) diff --git a/sysdeps/aarch64/fpu/exp2m1f_sve.c b/sysdeps/aarch64/fpu/exp2m1f_sve.c index f1fff1af43c..d53a805e19c 100644 --- a/sysdeps/aarch64/fpu/exp2m1f_sve.c +++ b/sysdeps/aarch64/fpu/exp2m1f_sve.c @@ -18,20 +18,21 @@ . */ #include "sv_math.h" +#include "sv_expf_special_inline.h" -/* Value of |x| above which scale overflows without special treatment. */ -#define SpecialBound 126.0f /* rint (log2 (2^127 / (1 + sqrt (2)))). */ - -/* Value of n above which scale overflows even with special treatment. */ -#define ScaleBound 192.0f +/* Value of |x| above which scale overflows without special treatment. + log2(2^(127 + 0.5)) = 127.5. */ +#define SpecialBound 0x1.fep+6 static const struct data { - uint32_t exponent_bias, special_offset, special_bias; - float32_t scale_thresh, special_bound; + struct sv_expf_special_data special_data; + float32_t special_bound; float log2_lo, c2, c4, c6; float log2_hi, c1, c3, c5, shift; + uint32_t exponent_bias; } data = { + .special_data = SV_EXPF_SPECIAL_DATA, /* Coefficients generated using remez's algorithm for exp2m1f(x). */ .log2_hi = 0x1.62e43p-1, .log2_lo = -0x1.05c610p-29, @@ -42,31 +43,9 @@ static const struct data .c5 = 0x1.440dccp-13, .c6 = 0x1.e081d6p-17, .exponent_bias = 0x3f800000, - .special_offset = 0x82000000, - .special_bias = 0x7f000000, - .scale_thresh = ScaleBound, .special_bound = SpecialBound, }; -static svfloat32_t NOINLINE -special_case (svfloat32_t poly, svfloat32_t n, svuint32_t e, svbool_t cmp1, - svfloat32_t scale, const struct data *d) -{ - svbool_t b = svcmple (svptrue_b32 (), n, 0.0f); - svfloat32_t s1 = svreinterpret_f32 ( - svsel (b, sv_u32 (d->special_offset + d->special_bias), - sv_u32 (d->special_bias))); - svfloat32_t s2 - = svreinterpret_f32 (svsub_m (b, e, sv_u32 (d->special_offset))); - svbool_t cmp2 = svacgt (svptrue_b32 (), n, d->scale_thresh); - svfloat32_t r2 = svmul_x (svptrue_b32 (), s1, s1); - svfloat32_t r1 - = svmul_x (svptrue_b32 (), svmla_x (svptrue_b32 (), s2, poly, s2), s1); - svfloat32_t r0 = svmla_x (svptrue_b32 (), scale, poly, scale); - svfloat32_t r = svsel (cmp1, r1, r0); - return svsub_x (svptrue_b32 (), svsel (cmp2, r2, r), 1.0f); -} - /* Single-precision vector exp2(x) - 1 function. The maximum error is 1.76 + 0.5 ULP. _ZGVsMxv_exp2m1f (0x1.018af8p-1) got 0x1.ab2ebcp-2 @@ -78,11 +57,7 @@ svfloat32_t SV_NAME_F1 (exp2m1) (svfloat32_t x, const svbool_t pg) svfloat32_t n = svrinta_x (pg, x); svfloat32_t r = svsub_x (pg, x, n); - svuint32_t e = svlsl_x (pg, svreinterpret_u32 (svcvt_s32_x (pg, n)), 23); - svfloat32_t scale - = svreinterpret_f32 (svadd_n_u32_x (pg, e, d->exponent_bias)); - - svbool_t cmp = svacgt_n_f32 (pg, n, d->special_bound); + svfloat32_t scale = svscale_x (pg, sv_f32 (1.0f), svcvt_s32_x (pg, n)); svfloat32_t r2 = svmul_x (pg, r, r); @@ -98,11 +73,10 @@ svfloat32_t SV_NAME_F1 (exp2m1) (svfloat32_t x, const svbool_t pg) svmul_x (svptrue_b32 (), r, sv_f32 (d->log2_hi)), r, log2lo_c246, 0); poly = svmla_x (pg, poly, p16, r2); - svfloat32_t y = svmla_x (pg, svsub_x (pg, scale, 1.0f), poly, scale); - + svbool_t cmp = svacge_n_f32 (svptrue_b32 (), x, d->special_bound); /* Fallback to special case for lanes with overflow. */ - if (__glibc_unlikely (svptest_any (pg, cmp))) - return svsel_f32 (cmp, special_case (poly, n, e, cmp, scale, d), y); + if (__glibc_unlikely (svptest_any (cmp, cmp))) + return special_case (poly, n, scale, cmp, &d->special_data); - return y; + return svmla_x (pg, svsub_x (pg, scale, 1.0f), poly, scale); } diff --git a/sysdeps/aarch64/fpu/exp_advsimd.c b/sysdeps/aarch64/fpu/exp_advsimd.c index 6a64de10fc1..30fdeac37ea 100644 --- a/sysdeps/aarch64/fpu/exp_advsimd.c +++ b/sysdeps/aarch64/fpu/exp_advsimd.c @@ -18,80 +18,90 @@ . */ #include "v_math.h" +#include "v_exp_special_case_inline.h" -#define N (1 << V_EXP_TABLE_BITS) -#define IndexMask (N - 1) +/* Value of |x| above which scale overflows without special treatment. */ +#define SpecialBound 0x1.6232bdd76683cp+9 /* ln(2^1022) ~ 708.40. */ -const static volatile struct +/* Value of n above which scale overflows even with special treatment. */ +#define ScaleBound 163840.0 /* 1280.0 * N. */ + +static const struct data { - float64x2_t poly[3]; - float64x2_t inv_ln2, ln2_hi, ln2_lo, shift; - float64x2_t special_bound, scale_thresh; + struct v_exp_special_data special_data; + float64x2_t inv_ln2, shift; + float64x2_t special_bound, scale_thresh, c0; + double ln2_hi, ln2_lo; + double c1, c2; } data = { + .special_data = V_EXP_SPECIAL_DATA, /* maxerr: 1.88 +0.5 ulp rel error: 1.4337*2^-53 abs error: 1.4299*2^-53 in [ -ln2/256, ln2/256 ]. */ - .poly = { V2 (0x1.ffffffffffd43p-2), V2 (0x1.55555c75adbb2p-3), - V2 (0x1.55555da646206p-5) }, - .scale_thresh = V2 (163840.0), /* 1280.0 * N. */ - .special_bound = V2 (704.0), + .c0 = V2 (0x1.ffffffffffd43p-2), + .c1 = 0x1.55555c75adbb2p-3, + .c2 = 0x1.55555da646206p-5, + .scale_thresh = V2 (ScaleBound), + /* Implementation triggers special case handling as soon as the scale + overflows, which is earlier than exp's overflow bound + `ln(DBL_MAX) ≈ 709.78` or underflow bound + `ln(DBL_TRUE_MIN) ≈ -744.44`. + The absolute comparison catches all these cases efficiently, although + there is a small window where special cases are triggered + unnecessarily. */ + .special_bound = V2 (SpecialBound), .inv_ln2 = V2 (0x1.71547652b82fep7), /* N/ln2. */ - .ln2_hi = V2 (0x1.62e42fefa39efp-8), /* ln2/N. */ - .ln2_lo = V2 (0x1.abc9e3b39803f3p-63), - .shift = V2 (0x1.8p+52) + .ln2_hi = 0x1.62e42fefa39efp-8, /* ln2/N. */ + .ln2_lo = 0x1.abc9e3b39803f3p-63, + .shift = V2 (0x1.8p+52), }; -#define C(i) data.poly[i] -#define Tab __v_exp_data -# define SpecialOffset v_u64 (0x6000000000000000) /* 0x1p513. */ -/* SpecialBias1 + SpecialBias1 = asuint(1.0). */ -# define SpecialBias1 v_u64 (0x7000000000000000) /* 0x1p769. */ -# define SpecialBias2 v_u64 (0x3010000000000000) /* 0x1p-254. */ +#define N (1 << V_EXP_TABLE_BITS) +#define IndexMask (N - 1) -static inline float64x2_t VPCS_ATTR -special_case (float64x2_t s, float64x2_t y, float64x2_t n) +static inline uint64x2_t +lookup_sbits (uint64x2_t i) { - /* 2^(n/N) may overflow, break it up into s1*s2. */ - uint64x2_t b = vandq_u64 (vcltzq_f64 (n), SpecialOffset); - float64x2_t s1 = vreinterpretq_f64_u64 (vsubq_u64 (SpecialBias1, b)); - float64x2_t s2 = vreinterpretq_f64_u64 ( - vaddq_u64 (vsubq_u64 (vreinterpretq_u64_f64 (s), SpecialBias2), b)); - uint64x2_t cmp = vcagtq_f64 (n, data.scale_thresh); - float64x2_t r1 = vmulq_f64 (s1, s1); - float64x2_t r0 = vmulq_f64 (vfmaq_f64 (s2, y, s2), s1); - return vbslq_f64 (cmp, r1, r0); + return (uint64x2_t){ __v_exp_data[i[0] & IndexMask], + __v_exp_data[i[1] & IndexMask] }; } +/* Fast vector implementation of exp. + Maximum measured error is 1.9 + 0.5ulp. + _ZGVnN2v_expm1(0x1.63b3d5efc2e51p-2) got 0x1.a94d602bf433ep-2 + want 0x1.a94d602bf433cp-2. */ float64x2_t VPCS_ATTR V_NAME_D1 (exp) (float64x2_t x) { - float64x2_t n, r, r2, s, y, z; - uint64x2_t cmp, u, e; - - cmp = vcagtq_f64 (x, data.special_bound); + const struct data *d = ptr_barrier (&data); /* n = round(x/(ln2/N)). */ - z = vfmaq_f64 (data.shift, x, data.inv_ln2); - u = vreinterpretq_u64_f64 (z); - n = vsubq_f64 (z, data.shift); + float64x2_t z = vfmaq_f64 (d->shift, x, d->inv_ln2); + uint64x2_t u = vreinterpretq_u64_f64 (z); + float64x2_t n = vsubq_f64 (z, d->shift); /* r = x - n*ln2/N. */ - r = vfmsq_f64 (x, data.ln2_hi, n); - r = vfmsq_f64 (r, data.ln2_lo, n); + float64x2_t ln2_hi_lo = vld1q_f64 (&d->ln2_hi); + float64x2_t r = x; + r = vfmsq_laneq_f64 (r, n, ln2_hi_lo, 0); + r = vfmsq_laneq_f64 (r, n, ln2_hi_lo, 1); + + uint64x2_t e = vshlq_n_u64 (u, 52 - V_EXP_TABLE_BITS); - e = vshlq_n_u64 (u, 52 - V_EXP_TABLE_BITS); + /* poly = exp(r) - 1 ~= r + C0 r^2 + C1 r^3 + C2 r^4. */ + float64x2_t c12 = vld1q_f64 (&d->c1); + float64x2_t r2 = vmulq_f64 (r, r); + float64x2_t poly = vfmaq_laneq_f64 (d->c0, r, c12, 0); + poly = vfmaq_laneq_f64 (poly, r2, c12, 1); + poly = vfmaq_f64 (r, poly, r2); - /* y = exp(r) - 1 ~= r + C0 r^2 + C1 r^3 + C2 r^4. */ - r2 = vmulq_f64 (r, r); - y = vfmaq_f64 (C (0), C (1), r); - y = vfmaq_f64 (y, C (2), r2); - y = vfmaq_f64 (r, y, r2); + /* scale = 2^(n/N). */ + u = lookup_sbits (u); + float64x2_t scale = vreinterpretq_f64_u64 (vaddq_u64 (u, e)); - /* s = 2^(n/N). */ - u = (uint64x2_t){ Tab[u[0] & IndexMask], Tab[u[1] & IndexMask] }; - s = vreinterpretq_f64_u64 (vaddq_u64 (u, e)); + uint64x2_t cmp = vcagtq_f64 (x, d->special_bound); if (__glibc_unlikely (v_any_u64 (cmp))) - return special_case (s, y, n); + return exp_special (poly, n, scale, d->scale_thresh, &d->special_data); - return vfmaq_f64 (s, y, s); + return vfmaq_f64 (scale, poly, scale); } diff --git a/sysdeps/aarch64/fpu/exp_sve.c b/sysdeps/aarch64/fpu/exp_sve.c index b9d42453183..7c8c5e0559e 100644 --- a/sysdeps/aarch64/fpu/exp_sve.c +++ b/sysdeps/aarch64/fpu/exp_sve.c @@ -18,14 +18,21 @@ . */ #include "sv_math.h" +#include "sv_exp_special_inline.h" + +/* Value of |x| above which scale overflows without special treatment. + ln(2^(1022 + 1/128)) ~ 708.40. */ +#define SpecialBound 0x1.62336f49c49a2p+9 static const struct data { + struct sv_exp_special_data special_data; double c0, c2; double c1, c3; - double ln2_hi, ln2_lo, inv_ln2, shift, thres; - + double ln2_hi, ln2_lo; + double inv_ln2, shift, special_bound; } data = { + .special_data = SV_EXP_SPECIAL_DATA, .c0 = 0x1.fffffffffdbcdp-2, .c1 = 0x1.555555555444cp-3, .c2 = 0x1.555573c6a9f7dp-5, @@ -36,44 +43,22 @@ static const struct data .inv_ln2 = 0x1.71547652b82fep+0, /* 1.5*2^46+1023. This value is further explained below. */ .shift = 0x1.800000000ffc0p+46, - .thres = 704.0, + .special_bound = SpecialBound, }; -#define SpecialOffset 0x6000000000000000 /* 0x1p513. */ -/* SpecialBias1 + SpecialBias1 = asuint(1.0). */ -#define SpecialBias1 0x7000000000000000 /* 0x1p769. */ -#define SpecialBias2 0x3010000000000000 /* 0x1p-254. */ - -/* Update of both special and non-special cases, if any special case is - detected. */ -static inline svfloat64_t -special_case (svbool_t pg, svfloat64_t s, svfloat64_t y, svfloat64_t n) +static svfloat64_t NOINLINE +special_exp (svfloat64_t poly, svfloat64_t scale, svfloat64_t n, svuint64_t u, + const struct sv_exp_special_data *ds) { - /* s=2^n may overflow, break it up into s=s1*s2, - such that exp = s + s*y can be computed as s1*(s2+s2*y) - and s1*s1 overflows only if n>0. */ - - /* If n<=0 then set b to 0x6, 0 otherwise. */ - svbool_t p_sign = svcmple (pg, n, 0.0); /* n <= 0. */ - svuint64_t b - = svdup_u64_z (p_sign, SpecialOffset); /* Inactive lanes set to 0. */ - - /* Set s1 to generate overflow depending on sign of exponent n, - ie. s1 = 0x70...0 - b. */ - svfloat64_t s1 = svreinterpret_f64 (svsubr_x (pg, b, SpecialBias1)); - /* Offset s to avoid overflow in final result if n is below threshold. - ie. s2 = as_u64 (s) - 0x3010...0 + b. */ - svfloat64_t s2 = svreinterpret_f64 ( - svadd_x (pg, svsub_x (pg, svreinterpret_u64 (s), SpecialBias2), b)); - - /* |n| > 1280 => 2^(n) overflows. */ - svbool_t p_cmp = svacgt (pg, n, 1280.0); - - svfloat64_t r1 = svmul_x (svptrue_b64 (), s1, s1); - svfloat64_t r2 = svmla_x (pg, s2, s2, y); - svfloat64_t r0 = svmul_x (svptrue_b64 (), r2, s1); - - return svsel (p_cmp, r1, r0); + /* FEXPA zeroes the sign bit, however the sign is meaningful to the + special case function so needs to be copied. + e = sign bit of u << 46. */ + svuint64_t e = svand_x (svptrue_b64 (), svlsl_x (svptrue_b64 (), u, 46), + 0x8000000000000000); + /* Copy sign to scale. */ + scale = svreinterpret_f64 ( + svadd_x (svptrue_b64 (), e, svreinterpret_u64 (scale))); + return special_case (scale, poly, n, ds); } /* SVE exp algorithm. Maximum measured error is 1.01ulps: @@ -83,8 +68,6 @@ svfloat64_t SV_NAME_D1 (exp) (svfloat64_t x, const svbool_t pg) { const struct data *d = ptr_barrier (&data); - svbool_t special = svacgt (pg, x, d->thres); - /* Use a modifed version of the shift used for flooring, such that x/ln2 is rounded to a multiple of 2^-6=1/64, shift = 1.5 * 2^52 * 2^-6 = 1.5 * 2^46. @@ -107,36 +90,29 @@ svfloat64_t SV_NAME_D1 (exp) (svfloat64_t x, const svbool_t pg) svfloat64_t n = svsub_x (pg, z, d->shift); svfloat64_t c13 = svld1rq (svptrue_b64 (), &d->c1); /* r = x - n * ln2, r is in [-ln2/(2N), ln2/(2N)]. */ - svfloat64_t ln2 = svld1rq (svptrue_b64 (), &d->ln2_hi); - svfloat64_t r = svmls_lane (x, n, ln2, 0); - r = svmls_lane (r, n, ln2, 1); + svfloat64_t ln2_hi_lo = svld1rq (svptrue_b64 (), &d->ln2_hi); + svfloat64_t r = x; + r = svmls_lane (r, n, ln2_hi_lo, 0); + r = svmls_lane (r, n, ln2_hi_lo, 1); - /* y = exp(r) - 1 ~= r + C0 r^2 + C1 r^3 + C2 r^4 + C3 r^5. */ + /* poly = exp(r) - 1 ~= r + C0 r^2 + C1 r^3 + C2 r^4 + C3 r^5. */ svfloat64_t r2 = svmul_x (svptrue_b64 (), r, r); svfloat64_t p01 = svmla_lane (sv_f64 (d->c0), r, c13, 0); svfloat64_t p23 = svmla_lane (sv_f64 (d->c2), r, c13, 1); svfloat64_t p04 = svmla_x (pg, p01, p23, r2); - svfloat64_t y = svmla_x (pg, r, p04, r2); + svfloat64_t poly = svmla_x (pg, r, p04, r2); - /* s = 2^n, computed using FEXPA. FEXPA does not propagate NaNs, so for + /* scale = 2^n, computed using FEXPA. FEXPA does not propagate NaNs, so for consistent NaN handling we have to manually propagate them. This comes at significant performance cost. */ - svfloat64_t s = svexpa (u); + svfloat64_t scale = svexpa (u); + svbool_t special = svacge (svptrue_b64 (), x, d->special_bound); /* Assemble result as exp(x) = 2^n * exp(r). If |x| > Thresh the multiplication may overflow, so use special case routine. */ - - if (__glibc_unlikely (svptest_any (pg, special))) - { - /* FEXPA zeroes the sign bit, however the sign is meaningful to the - special case function so needs to be copied. - e = sign bit of u << 46. */ - svuint64_t e = svand_x (pg, svlsl_x (pg, u, 46), 0x8000000000000000); - /* Copy sign to s. */ - s = svreinterpret_f64 (svadd_x (pg, e, svreinterpret_u64 (s))); - return special_case (pg, s, y, n); - } + if (__glibc_unlikely (svptest_any (special, special))) + return special_exp (poly, scale, n, u, &d->special_data); /* No special case. */ - return svmla_x (pg, s, s, y); + return svmla_x (pg, scale, scale, poly); } diff --git a/sysdeps/aarch64/fpu/expf_advsimd.c b/sysdeps/aarch64/fpu/expf_advsimd.c index 447e93db6f2..07b04c4e7ab 100644 --- a/sysdeps/aarch64/fpu/expf_advsimd.c +++ b/sysdeps/aarch64/fpu/expf_advsimd.c @@ -18,15 +18,20 @@ . */ #include "v_math.h" +#include "v_expf_special_inline.h" + +/* Value of |x| above which scale overflows without special treatment. */ +#define SpecialBound 0x1.5ebb83cf2cf96p+6 /* ≈ 87.69. */ static const struct data { + struct v_expf_special_data special_data; float32x4_t c1, c3, c4, inv_ln2; float ln2_hi, ln2_lo, c0, c2; - uint32x4_t exponent_bias, special_offset, special_bias; - float32x4_t special_bound, scale_thresh; + uint32x4_t exponent_bias; + float32x4_t special_bound; } data = { - /* maxerr: 1.45358 +0.5 ulp. */ + .special_data = V_EXPF_SPECIAL_DATA, .c0 = 0x1.0e4020p-7f, .c1 = V4 (0x1.573e2ep-5f), .c2 = 0x1.555e66p-3f, @@ -36,32 +41,19 @@ static const struct data .ln2_hi = 0x1.62e4p-1f, .ln2_lo = 0x1.7f7d1cp-20f, .exponent_bias = V4 (0x3f800000), - .special_offset = V4 (0x82000000), - .special_bias = V4 (0x7f000000), - .special_bound = V4 (126.0f), - .scale_thresh = V4 (192.0f), + /* Implementation triggers special case handling as soon as the scale + overflows, which is earlier than expf's overflow bound + `log1p(FLT_MAX)~88.7` or underflow bound `log1p(FLT_MIN)~-103.28`. + The absolute comparison catches all these cases efficiently, although + there is a small window where special cases are triggered + unnecessarily. */ + .special_bound = V4 (SpecialBound), }; -#define C(i) d->poly[i] - -static float32x4_t VPCS_ATTR NOINLINE -special_case (float32x4_t poly, float32x4_t n, uint32x4_t e, uint32x4_t cmp1, - float32x4_t scale, const struct data *d) -{ - /* 2^n may overflow, break it up into s1*s2. */ - uint32x4_t b = vandq_u32 (vclezq_f32 (n), d->special_offset); - float32x4_t s1 = vreinterpretq_f32_u32 (vaddq_u32 (b, d->special_bias)); - float32x4_t s2 = vreinterpretq_f32_u32 (vsubq_u32 (e, b)); - uint32x4_t cmp2 = vcagtq_f32 (n, d->scale_thresh); - float32x4_t r2 = vmulq_f32 (s1, s1); - // (s2 + p*s2)*s1 = s2(p+1)s1 - float32x4_t r1 = vmulq_f32 (vfmaq_f32 (s2, poly, s2), s1); - /* Similar to r1 but avoids double rounding in the subnormal range. */ - float32x4_t r0 = vfmaq_f32 (scale, poly, scale); - float32x4_t r = vbslq_f32 (cmp1, r1, r0); - return vbslq_f32 (cmp2, r2, r); -} - +/* Single-precision vector expf routine. + The maximum error is 1.44 +0.5 ULP: + _ZGVnN4v_expf(-0x1.86f03cp+5) got 0x1.69e27p-71 + want 0x1.69e274p-71. */ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp) (float32x4_t x) { const struct data *d = ptr_barrier (&data); @@ -75,7 +67,7 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp) (float32x4_t x) uint32x4_t e = vshlq_n_u32 (vreinterpretq_u32_s32 (vcvtq_s32_f32 (n)), 23); float32x4_t scale = vreinterpretq_f32_u32 (vaddq_u32 (e, d->exponent_bias)); - uint32x4_t cmp = vcagtq_f32 (n, d->special_bound); + uint32x4_t cmp = vcageq_f32 (x, d->special_bound); float32x4_t r2 = vmulq_f32 (r, r); float32x4_t p = vfmaq_laneq_f32 (d->c1, r, ln2_c02, 2); @@ -85,7 +77,7 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp) (float32x4_t x) float32x4_t poly = vfmaq_f32 (p, q, r2); if (__glibc_unlikely (v_any_u32 (cmp))) - return special_case (poly, n, e, cmp, scale, d); + return expf_special (poly, n, e, cmp, scale, &d->special_data); return vfmaq_f32 (scale, poly, scale); } diff --git a/sysdeps/aarch64/fpu/expf_sve.c b/sysdeps/aarch64/fpu/expf_sve.c index 9a59edbe37d..892b3864b02 100644 --- a/sysdeps/aarch64/fpu/expf_sve.c +++ b/sysdeps/aarch64/fpu/expf_sve.c @@ -18,36 +18,120 @@ . */ #include "sv_math.h" -#include "sv_expf_inline.h" -/* Roughly 87.3. For x < -Thres, the result is subnormal and not handled +/* For x < -SpecialBound, the result is subnormal and not handled correctly by FEXPA. */ -#define Thres 0x1.5d5e2ap+6f +#define SpecialBound 0x1.5d5e2ap+6f /* ln(2^126) ~ 87.34. */ + +/* Values of x which exp overflows or underflows. */ +#define InfBound 0x1.62e42fp6f /* ~ 88.72. */ +#define ZeroBound -0x1.9fe368p6f /* ~ 103.97. */ static const struct data { - struct sv_expf_data d; - float thres; + float ln2_hi, ln2_lo, c1, c3; + float shift, inv_ln2, c2; + float32_t special_bound, inf_bound, zero_bound; } data = { - .d = SV_EXPF_DATA, - .thres = Thres, + /* shift = 1.5*2^17 + 127. */ + .shift = 0x1.803f8p17f, + .inv_ln2 = 0x1.715476p+0f, + .ln2_hi = 0x1.62e4p-1f, + .ln2_lo = 0x1.7f7d1cp-20f, + /* Coefficients generated using Remez algorithm with minimisation of relative + error. */ + .c1 = 0.5f, + .c2 = 0x1.55559p-3, + .c3 = 0x1.55555cp-5, + .special_bound = SpecialBound, + .inf_bound = InfBound, + .zero_bound = ZeroBound, }; +static inline svfloat32_t +expf_inline (svfloat32_t x, const svbool_t pg, const struct data *d) +{ + /* exp(x) = 2^n (1 + poly(r)), with 1 + poly(r) in [1/sqrt(2),sqrt(2)] + x = ln2*n + r, with r in [-ln2/2, ln2/2]. */ + svfloat32_t lane_constants = svld1rq (svptrue_b32 (), &d->ln2_hi); + + /* n = round(x/(ln2/N)). */ + svfloat32_t z = svmad_x (pg, sv_f32 (d->inv_ln2), x, d->shift); + svfloat32_t n = svsub_x (pg, z, d->shift); + + /* r = x - n*ln2/N. */ + svfloat32_t r = x; + r = svmls_lane (r, n, lane_constants, 0); + r = svmls_lane (r, n, lane_constants, 1); + + /* scale = 2^(n/N). */ + svfloat32_t scale = svexpa (svreinterpret_u32 (z)); + + /* poly(r) = exp(r) - 1 ~= r + 0.5 r^2. */ + svfloat32_t r2 = svmul_x (svptrue_b32 (), r, r); + svfloat32_t poly = svmla_lane (r, r2, lane_constants, 2); + + return svmla_x (pg, scale, scale, poly); +} + +/* This is the version used if there is any special lane in the input vector. + The approximation needs to match that of the fast path. + To achieve this we assemble the same polynomial, ie `r + 0.5 * r^2`, + then we conditionally add an extra `c2 * r^3` term. */ +static inline svfloat32_t +expf_slow_inline (svfloat32_t x, const svbool_t special, const struct data *d) +{ + svfloat32_t lane_constants = svld1rq (svptrue_b32 (), &d->ln2_hi); + + svfloat32_t z = svmad_x (svptrue_b32 (), sv_f32 (d->inv_ln2), x, d->shift); + svfloat32_t n = svsub_x (svptrue_b32 (), z, d->shift); + + svfloat32_t r = x; + r = svmls_lane (r, n, lane_constants, 0); + r = svmls_lane (r, n, lane_constants, 1); + + /* poly(r) = exp(r) - 1 ~= r + 0.5 r^2 + c2 r^3. */ + svfloat32_t r2 = svmul_x (svptrue_b32 (), r, r); + svfloat32_t r3 = svmul_x (svptrue_b32 (), r, r2); + svfloat32_t poly_slow = svmla_lane (sv_f32 (d->c2), r, lane_constants, 3); + svfloat32_t poly_fast = svmla_lane (r, r2, lane_constants, 2); + svfloat32_t poly = svmla_x (special, poly_fast, poly_slow, r3); + + svfloat32_t scale = svexpa (svreinterpret_u32 (z)); + + return svmla_x (svptrue_b32 (), scale, scale, poly); +} + static svfloat32_t NOINLINE -special_case (svfloat32_t x, svbool_t special, const struct sv_expf_data *d) +special_case (svfloat32_t x, svbool_t pg, svbool_t special, + const struct data *d) { - return sv_call_f32 (expf, x, expf_inline (x, svptrue_b32 (), d), special); + /* This deals with overflow and underflow in exponential for special case + lanes. */ + svbool_t is_inf = svcmpgt (pg, x, d->inf_bound); + svbool_t is_zero = svcmplt (pg, x, d->zero_bound); + + /* The input `x` is further reduced (to `x/2`) to allow for accurate + approximation on the interval `x > SpecialBound ~ 87.3`. */ + x = svmul_m (special, x, 0.5); + + /* Computes exp(x/2), and set lanes with underflow/overflow. */ + svfloat32_t half_exp = expf_slow_inline (x, special, d); + half_exp = svmul_m (special, half_exp, half_exp); + half_exp = svsel (is_inf, sv_f32 (INFINITY), half_exp); + + return svsel (is_zero, sv_f32 (0), half_exp); } /* Optimised single-precision SVE exp function. - Worst-case error is 0.88 +0.50 ULP: - _ZGVsMxv_expf(-0x1.bba276p-6) got 0x1.f25288p-1 - want 0x1.f2528ap-1. */ + Worst-case error is 2.70 +0.50 ULP: + _ZGVsMxv_expf(0x1.5fec38p+6) got 0x1.e7831ep+126 + want 0x1.e78318p+126. */ svfloat32_t SV_NAME_F1 (exp) (svfloat32_t x, const svbool_t pg) { const struct data *d = ptr_barrier (&data); - svbool_t is_special_case = svacgt (pg, x, d->thres); - if (__glibc_unlikely (svptest_any (pg, is_special_case))) - return special_case (x, is_special_case, &d->d); - return expf_inline (x, pg, &d->d); + svbool_t special = svacgt (pg, x, d->special_bound); + if (__glibc_unlikely (svptest_any (special, special))) + return special_case (x, pg, special, d); + return expf_inline (x, svptrue_b32 (), d); } diff --git a/sysdeps/aarch64/fpu/expm1_advsimd.c b/sysdeps/aarch64/fpu/expm1_advsimd.c index 61c1c755479..eb32851b1b3 100644 --- a/sysdeps/aarch64/fpu/expm1_advsimd.c +++ b/sysdeps/aarch64/fpu/expm1_advsimd.c @@ -18,39 +18,112 @@ . */ #include "v_math.h" -#include "v_expm1_inline.h" +#include "v_exp_special_case_inline.h" + +/* Value of |x| above which scale overflows without special treatment. */ +#define SpecialBound 0x1.62b7d369a5aa9p+9 /* ln(√2 * (2^1023) ~ 709.44. */ + +/* Value of n above which scale overflows even with special treatment. */ +#define ScaleBound 0x1.4p+10 /* 1280.0. */ static const struct data { - struct v_expm1_data d; - float64x2_t oflow_bound; + struct v_exp_special_data special_data; + double c1, c3, c5, c7; + double c9, c10, ln2[2]; + float64x2_t c2, c4, c6, c8; + float64x2_t invln2; + int64x2_t exponent_bias; + float64x2_t special_bound, scale_thresh; } data = { - .d = V_EXPM1_DATA, - /* Value above which expm1(x) should overflow. Absolute value of the - underflow bound is greater than this, so it catches both cases - there is - a small window where fallbacks are triggered unnecessarily. */ - .oflow_bound = V2 (0x1.62b7d369a5aa9p+9), + /* Special case data from helper. */ + .special_data = V_EXP_SPECIAL_DATA, + .c1 = 0x1.5555555555559p-3, + .c2 = V2 (0x1.555555555554bp-5), + .c3 = 0x1.111111110f663p-7, + .c4 = V2 (0x1.6c16c16c1b5f3p-10), + .c5 = 0x1.a01a01affa35dp-13, + .c6 = V2 (0x1.a01a018b4ecbbp-16), + .c7 = 0x1.71ddf82db5bb4p-19, + .c8 = V2 (0x1.27e517fc0d54bp-22), + .c9 = 0x1.af5eedae67435p-26, + .c10 = 0x1.1f143d060a28ap-29, + .ln2 = { 0x1.62e42fefa39efp-1, 0x1.abc9e3b39803fp-56 }, + .invln2 = V2 (0x1.71547652b82fep0), + .exponent_bias = V2 (0x3ff0000000000000), + /* Implementation triggers special case handling as soon as the scale + overflows, which is earlier than expm1's overflow bound + `ln(DBL_MAX) ≈ 709.78` or underflow bound + `ln(DBL_TRUE_MIN) ≈ -744.44`. + The absolute comparison catches all these cases efficiently, although + there is a small window where special cases are triggered + unnecessarily. */ + .special_bound = V2 (SpecialBound), + .scale_thresh = V2 (ScaleBound), }; -static float64x2_t VPCS_ATTR NOINLINE -special_case (float64x2_t x, uint64x2_t special, const struct data *d) -{ - return v_call_f64 (expm1, x, expm1_inline (x, &d->d), special); -} - /* Double-precision vector exp(x) - 1 function. - The maximum error observed error is 2.05 ULP: + The maximum error observed error is 1.55 +0.5 ULP: _ZGVnN2v_expm1(0x1.6329669eb8c87p-2) got 0x1.a8897eef87b34p-2 want 0x1.a8897eef87b32p-2. */ float64x2_t VPCS_ATTR V_NAME_D1 (expm1) (float64x2_t x) { const struct data *d = ptr_barrier (&data); - /* Large input, NaNs and Infs. */ - uint64x2_t special = vcageq_f64 (x, d->oflow_bound); - if (__glibc_unlikely (v_any_u64 (special))) - return special_case (x, special, d); + /* Helper routine for calculating exp(x) - 1. */ + + float64x2_t ln2 = vld1q_f64 (&d->ln2[0]); + + /* Reduce argument to smaller range: + Let i = round(x / ln2) + and f = x - i * ln2, then f is in [-ln2/2, ln2/2]. + exp(x) - 1 = 2^i * (expm1(f) + 1) - 1 + where 2^i is exact because i is an integer. */ + float64x2_t n = vrndaq_f64 (vmulq_f64 (x, d->invln2)); + int64x2_t i = vcvtq_s64_f64 (n); + float64x2_t f = vfmsq_laneq_f64 (x, n, ln2, 0); + f = vfmsq_laneq_f64 (f, n, ln2, 1); + + /* Approximate expm1(f) using polynomial. + Taylor expansion for expm1(x) has the form: + x + ax^2 + bx^3 + cx^4 .... + So we calculate the polynomial P(f) = a + bf + cf^2 + ... + and assemble the approximation expm1(f) ~= f + f^2 * P(f). */ + float64x2_t f2 = vmulq_f64 (f, f); + float64x2_t f4 = vmulq_f64 (f2, f2); + float64x2_t lane_consts_13 = vld1q_f64 (&d->c1); + float64x2_t lane_consts_57 = vld1q_f64 (&d->c5); + float64x2_t lane_consts_910 = vld1q_f64 (&d->c9); + float64x2_t p01 = vfmaq_laneq_f64 (v_f64 (0.5), f, lane_consts_13, 0); + float64x2_t p23 = vfmaq_laneq_f64 (d->c2, f, lane_consts_13, 1); + float64x2_t p45 = vfmaq_laneq_f64 (d->c4, f, lane_consts_57, 0); + float64x2_t p67 = vfmaq_laneq_f64 (d->c6, f, lane_consts_57, 1); + float64x2_t p03 = vfmaq_f64 (p01, f2, p23); + float64x2_t p47 = vfmaq_f64 (p45, f2, p67); + float64x2_t p89 = vfmaq_laneq_f64 (d->c8, f, lane_consts_910, 0); + float64x2_t poly = vfmaq_laneq_f64 (p89, f2, lane_consts_910, 1); + poly = vfmaq_f64 (p47, f4, poly); + poly = vfmaq_f64 (p03, f4, poly); + + poly = vfmaq_f64 (f, f2, poly); + + /* Assemble the result. + expm1(x) ~= 2^i * (poly + 1) - 1 + Let scale = 2^i. */ + int64x2_t u = vaddq_s64 (vshlq_n_s64 (i, 52), d->exponent_bias); + float64x2_t scale = vreinterpretq_f64_s64 (u); + + float64x2_t y = vfmaq_f64 (vsubq_f64 (scale, v_f64 (1.0)), poly, scale); + + uint64x2_t cmp = vcageq_f64 (x, d->special_bound); - /* expm1(x) ~= p * t + (t - 1). */ - return expm1_inline (x, &d->d); + /* Fallback to special case for lanes with overflow. */ + if (__glibc_unlikely (v_any_u64 (cmp))) + { + float64x2_t special + = (exp_special (poly, n, scale, d->scale_thresh, &d->special_data)); + float64x2_t specialm1 = vsubq_f64 (special, v_f64 (1.0f)); + return vbslq_f64 (cmp, specialm1, y); + } + return y; } diff --git a/sysdeps/aarch64/fpu/expm1_sve.c b/sysdeps/aarch64/fpu/expm1_sve.c index 1ff4b9a593d..52abd8af1b9 100644 --- a/sysdeps/aarch64/fpu/expm1_sve.c +++ b/sysdeps/aarch64/fpu/expm1_sve.c @@ -18,36 +18,42 @@ . */ #include "sv_math.h" +#include "sv_exp_special_inline.h" + +/* Value of |x| above which scale overflows without special treatment. + ln(2^(1023 + 1/128)) ~ 709.10. */ +#define SpecialBound 0x1.628c2855bfaddp+9 #define FexpaBound 0x1.4cb5ecef28adap-3 /* 15*ln2/64. */ -#define SpecialBound 0x1.628c2855bfaddp+9 /* ln(2^(1023 + 1/128)). */ static const struct data { - double c2, c4; - double inv_ln2; + struct sv_exp_special_data special_data; double ln2_hi, ln2_lo; - double c0, c1, c3; - double shift, thres; + double c2, c4; + double inv_ln2, c1, c3, shift; + double special_bound, fexpa_bound; uint64_t expm1_data[32]; } data = { + .special_data = SV_EXP_SPECIAL_DATA, /* Table emulating FEXPA - 1, for values of FEXPA close to 1. The table holds values of 2^(i/64) - 1, computed in arbitrary precision. The first half of the table stores values associated to i from 0 to 15. The second half of the table stores values associated to i from 0 to -15. */ - .expm1_data = { - 0x0000000000000000, 0x3f864d1f3bc03077, 0x3f966c34c5615d0f, 0x3fa0e8a30eb37901, - 0x3fa6ab0d9f3121ec, 0x3fac7d865a7a3440, 0x3fb1301d0125b50a, 0x3fb429aaea92ddfb, - 0x3fb72b83c7d517ae, 0x3fba35beb6fcb754, 0x3fbd4873168b9aa8, 0x3fc031dc431466b2, - 0x3fc1c3d373ab11c3, 0x3fc35a2b2f13e6e9, 0x3fc4f4efa8fef709, 0x3fc6942d3720185a, - 0x0000000000000000, 0xbfc331751ec3a814, 0xbfc20224341286e4, 0xbfc0cf85bed0f8b7, - 0xbfbf332113d56b1f, 0xbfbcc0768d4175a6, 0xbfba46f918837cb7, 0xbfb7c695afc3b424, - 0xbfb53f391822dbc7, 0xbfb2b0cfe1266bd4, 0xbfb01b466423250a, 0xbfaafd11874c009e, - 0xbfa5b505d5b6f268, 0xbfa05e4119ea5d89, 0xbf95f134923757f3, 0xbf860f9f985bc9f4, - }, + .expm1_data = { 0x0000000000000000, 0x3f864d1f3bc03077, 0x3f966c34c5615d0f, + 0x3fa0e8a30eb37901, 0x3fa6ab0d9f3121ec, 0x3fac7d865a7a3440, + 0x3fb1301d0125b50a, 0x3fb429aaea92ddfb, 0x3fb72b83c7d517ae, + 0x3fba35beb6fcb754, 0x3fbd4873168b9aa8, 0x3fc031dc431466b2, + 0x3fc1c3d373ab11c3, 0x3fc35a2b2f13e6e9, 0x3fc4f4efa8fef709, + 0x3fc6942d3720185a, 0x0000000000000000, 0xbfc331751ec3a814, + 0xbfc20224341286e4, 0xbfc0cf85bed0f8b7, 0xbfbf332113d56b1f, + 0xbfbcc0768d4175a6, 0xbfba46f918837cb7, 0xbfb7c695afc3b424, + 0xbfb53f391822dbc7, 0xbfb2b0cfe1266bd4, 0xbfb01b466423250a, + 0xbfaafd11874c009e, 0xbfa5b505d5b6f268, 0xbfa05e4119ea5d89, + 0xbf95f134923757f3, 0xbf860f9f985bc9f4 }, /* Generated using Remez, in [-log(2)/128, log(2)/128]. */ - .c0 = 0x1p-1, + /* c0 = 0x1p-1 = 0.5. Passed directly as 0.5. */ .c1 = 0x1.55555555548f9p-3, .c2 = 0x1.5555555554c22p-5, .c3 = 0x1.111123aaa2fb2p-7, @@ -56,44 +62,26 @@ static const struct data .ln2_lo = 0x1.ef35793c76730p-45, .inv_ln2 = 0x1.71547652b82fep+0, .shift = 0x1.800000000ffc0p+46, /* 1.5*2^46+1023. */ - .thres = SpecialBound, + .special_bound = SpecialBound, + .fexpa_bound = FexpaBound, }; -#define SpecialOffset 0x6000000000000000 /* 0x1p513. */ -/* SpecialBias1 + SpecialBias1 = asuint(1.0). */ -#define SpecialBias1 0x7000000000000000 /* 0x1p769. */ -#define SpecialBias2 0x3010000000000000 /* 0x1p-254. */ - -static NOINLINE svfloat64_t -special_case (svbool_t pg, svfloat64_t y, svfloat64_t s, svfloat64_t p, - svfloat64_t n) +static svfloat64_t NOINLINE +special_m1 (svbool_t special, svfloat64_t y, svfloat64_t z, svfloat64_t scale, + svfloat64_t poly, svfloat64_t n, + const struct sv_exp_special_data *ds) { - /* s=2^n may overflow, break it up into s=s1*s2, - such that exp = s + s*y can be computed as s1*(s2+s2*y) - and s1*s1 overflows only if n>0. */ - - /* If n<=0 then set b to 0x6, 0 otherwise. */ - svbool_t p_sign = svcmple (pg, n, 0.0); /* n <= 0. */ - svuint64_t b - = svdup_u64_z (p_sign, SpecialOffset); /* Inactive lanes set to 0. */ - - /* Set s1 to generate overflow depending on sign of exponent n, - ie. s1 = 0x70...0 - b. */ - svfloat64_t s1 = svreinterpret_f64 (svsubr_x (pg, b, SpecialBias1)); - /* Offset s to avoid overflow in final result if n is below threshold. - ie. s2 = as_u64 (s) - 0x3010...0 + b. */ - svfloat64_t s2 = svreinterpret_f64 ( - svadd_x (pg, svsub_x (pg, svreinterpret_u64 (s), SpecialBias2), b)); - - /* |n| > 1280 => 2^(n) overflows. */ - svbool_t p_cmp = svacgt (pg, n, 1280.0); - - svfloat64_t r1 = svmul_x (svptrue_b64 (), s1, s1); - svfloat64_t r2 = svmla_x (pg, s2, s2, p); - svfloat64_t r0 = svmul_x (svptrue_b64 (), r2, s1); - - svbool_t is_safe = svacle (pg, n, 1023); /* Only correct special lanes. */ - return svsel (is_safe, y, svsub_x (pg, svsel (p_cmp, r1, r0), 1.0)); + /* FEXPA zeroes the sign bit, however the sign is meaningful to the + special case function so needs to be copied. + e = sign bit of u << 46. */ + svuint64_t u = svreinterpret_u64 (z); + svuint64_t e = svand_x (svptrue_b64 (), svlsl_x (svptrue_b64 (), u, 46), + 0x8000000000000000); + /* Copy sign to scale. */ + scale = svreinterpret_f64 ( + svadd_x (svptrue_b64 (), e, svreinterpret_u64 (scale))); + svfloat64_t special_result = special_case (scale, poly, n, ds); + return svsel (special, svsub_x (svptrue_b64 (), special_result, 1.0), y); } /* FEXPA based SVE expm1 algorithm. @@ -104,8 +92,6 @@ svfloat64_t SV_NAME_D1 (expm1) (svfloat64_t x, svbool_t pg) { const struct data *d = ptr_barrier (&data); - svbool_t special = svacgt (pg, x, d->thres); - svfloat64_t z = svmla_x (pg, sv_f64 (d->shift), x, d->inv_ln2); svuint64_t u = svreinterpret_u64 (z); svfloat64_t n = svsub_x (pg, z, d->shift); @@ -120,12 +106,12 @@ svfloat64_t SV_NAME_D1 (expm1) (svfloat64_t x, svbool_t pg) svfloat64_t r2 = svmul_x (svptrue_b64 (), r, r); svfloat64_t c24 = svld1rq (svptrue_b64 (), &d->c2); - svfloat64_t p; + svfloat64_t poly; svfloat64_t c12 = svmla_lane (sv_f64 (d->c1), r, c24, 0); svfloat64_t c34 = svmla_lane (sv_f64 (d->c3), r, c24, 1); - p = svmad_x (pg, c34, r2, c12); - p = svmad_x (pg, p, r, sv_f64 (d->c0)); - p = svmad_x (pg, p, r2, r); + poly = svmad_x (pg, c34, r2, c12); + poly = svmad_x (pg, poly, r, sv_f64 (0.5)); + poly = svmad_x (pg, poly, r2, r); svfloat64_t scale = svexpa (u); svfloat64_t scalem1 = svsub_x (pg, scale, sv_f64 (1.0)); @@ -140,7 +126,8 @@ svfloat64_t SV_NAME_D1 (expm1) (svfloat64_t x, svbool_t pg) This bound is based upon the table size: Bound = (TableSize-1/64) * ln2. The current bound is based upon a table size of 16. */ - svbool_t is_small = svaclt (pg, x, FexpaBound); + + svbool_t is_small = svaclt (pg, x, d->fexpa_bound); if (svptest_any (pg, is_small)) { @@ -162,20 +149,15 @@ svfloat64_t SV_NAME_D1 (expm1) (svfloat64_t x, svbool_t pg) scalem1 = svsel (is_small, lookup, scalem1); } - svfloat64_t y = svmla_x (pg, scalem1, scale, p); - + /* Fallback to special case for lanes with overflow. */ + svbool_t special = svacgt (svptrue_b64 (), x, d->special_bound); /* FEXPA returns nan for large inputs so we special case those. */ - if (__glibc_unlikely (svptest_any (pg, special))) + if (__glibc_unlikely (svptest_any (special, special))) { - /* FEXPA zeroes the sign bit, however the sign is meaningful to the - special case function so needs to be copied. - e = sign bit of u << 46. */ - svuint64_t e = svand_x (pg, svlsl_x (pg, u, 46), 0x8000000000000000); - /* Copy sign to s. */ - scale = svreinterpret_f64 (svadd_x (pg, e, svreinterpret_u64 (scale))); - return special_case (pg, y, scale, p, n); + svfloat64_t y = svmla_x (svptrue_b64 (), scalem1, scale, poly); + return special_m1 (special, y, z, scale, poly, n, &d->special_data); } /* return expm1 = (scale - 1) + (scale * poly). */ - return y; + return svmla_x (pg, scalem1, scale, poly); } diff --git a/sysdeps/aarch64/fpu/expm1f_advsimd.c b/sysdeps/aarch64/fpu/expm1f_advsimd.c index ba5558f2b7c..c652e66b9ed 100644 --- a/sysdeps/aarch64/fpu/expm1f_advsimd.c +++ b/sysdeps/aarch64/fpu/expm1f_advsimd.c @@ -18,26 +18,41 @@ . */ #include "v_math.h" -#include "v_expm1f_inline.h" +#include "v_expf_special_inline.h" + +/* Value of |x| above which scale overflows without special treatment. */ +#define SpecialBound 0x1.4d814bbe4473dp+6 /* ≈ 88.3763. */ static const struct data { - struct v_expm1f_data d; - float32x4_t oflow_bound; + struct v_expf_special_data special_data; + uint32x4_t exponent_bias; + float32x4_t c0, c2; + float c1, c3, inv_ln2, c4; + float ln2_hi, ln2_lo; + float32x4_t special_bound; } data = { - .d = V_EXPM1F_DATA, - /* Value above which expm1f(x) should overflow. Absolute value of the - underflow bound is greater than this, so it catches both cases - there is - a small window where fallbacks are triggered unnecessarily. */ - .oflow_bound = V4 (0x1.5ebc4p+6), + .special_data = V_EXPF_SPECIAL_DATA, + /* Coefficients generated using fpminimax with degree=5 in [-log(2)/2, + log(2)/2]. Exponent bias is asuint(1.0f). */ + .c0 = V4 (0x1.fffffep-2), + .c1 = 0x1.5554aep-3, + .c2 = V4 (0x1.555736p-5), + .c3 = 0x1.12287cp-7, + .c4 = 0x1.6b55a2p-10, + .exponent_bias = V4 (0x3f800000), + .inv_ln2 = 0x1.715476p+0f, + .ln2_hi = 0x1.62e4p-1f, + .ln2_lo = 0x1.7f7d1cp-20f, + /* Implementation triggers special case handling as soon as the scale + overflows, which is earlier than expm1f's overflow bound + `log1p(FLT_MAX)~88.7` or underflow bound `log1p(FLT_MIN)~-103.28`. + The absolute comparison catches all these cases efficiently, although + there is a small window where special cases are triggered + unnecessarily. */ + .special_bound = V4 (SpecialBound), }; -static float32x4_t VPCS_ATTR NOINLINE -special_case (float32x4_t x, uint32x4_t special, const struct data *d) -{ - return v_call_f32 (expm1f, x, expm1f_inline (x, &d->d), special); -} - /* Single-precision vector exp(x) - 1 function. The maximum error is 1.62 ULP: _ZGVnN4v_expm1f(0x1.85f83p-2) got 0x1.da9f4p-2 @@ -46,14 +61,44 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (expm1) (float32x4_t x) { const struct data *d = ptr_barrier (&data); + float32x4_t lane_consts = vld1q_f32 (&d->c1); + + /* exp(x) = 2^n * e^r = 2^n * (1 + poly (r)), + with 1 + poly(r) in [1/sqrt(2), sqrt(2)] and + x = r + n * ln(2), with r in [-ln(2)/2, ln(2)/2]. */ + float32x4_t n = vrndaq_f32 (vmulq_laneq_f32 (x, lane_consts, 2)); + float32x2_t ln2 = vld1_f32 (&d->ln2_hi); + float32x4_t r = vfmsq_lane_f32 (x, n, ln2, 0); + r = vfmsq_lane_f32 (r, n, ln2, 1); + + /* scale = 2^i. */ + uint32x4_t e = vshlq_n_u32 (vreinterpretq_u32_s32 (vcvtaq_s32_f32 (n)), 23); + float32x4_t scale = vreinterpretq_f32_u32 (vaddq_u32 (e, d->exponent_bias)); + /* Handles very large values (+ve and -ve), +/-NaN, +/-Inf. */ - uint32x4_t special = vcagtq_f32 (x, d->oflow_bound); + uint32x4_t cmp = vcageq_f32 (x, d->special_bound); + + /* Approximate expm1(r) with polynomial P, expm1(r) ~= r + r^2 * P(r). */ + float32x4_t r2 = vmulq_f32 (r, r); + float32x4_t r4 = vmulq_f32 (r2, r2); + float32x4_t p01 = vfmaq_laneq_f32 (d->c0, r, lane_consts, 0); + float32x4_t p23 = vfmaq_laneq_f32 (d->c2, r, lane_consts, 1); + float32x4_t poly = vfmaq_f32 (p01, r2, p23); + poly = vfmaq_laneq_f32 (poly, r4, lane_consts, 3); + poly = vfmaq_f32 (r, r2, poly); - if (__glibc_unlikely (v_any_u32 (special))) - return special_case (x, special, d); + /* expm1(x) ~= p * scale + (scale - 1). */ + float32x4_t y = vfmaq_f32 (vsubq_f32 (scale, v_f32 (1.0f)), poly, scale); - /* expm1(x) ~= p * t + (t - 1). */ - return expm1f_inline (x, &d->d); + /* Fallback to special case for lanes with overflow. */ + if (__glibc_unlikely (v_any_u32 (cmp))) + { + float32x4_t special + = (expf_special (poly, n, e, cmp, scale, &d->special_data)); + float32x4_t specialm1 = vsubq_f32 (special, v_f32 (1.0f)); + return vbslq_f32 (cmp, specialm1, y); + } + return y; } libmvec_hidden_def (V_NAME_F1 (expm1)) HALF_WIDTH_ALIAS_F1 (expm1) diff --git a/sysdeps/aarch64/fpu/expm1f_sve.c b/sysdeps/aarch64/fpu/expm1f_sve.c index b8244a3a649..07e13556a2d 100644 --- a/sysdeps/aarch64/fpu/expm1f_sve.c +++ b/sysdeps/aarch64/fpu/expm1f_sve.c @@ -18,77 +18,80 @@ . */ #include "sv_math.h" +#include "sv_expf_special_inline.h" -/* Largest value of x for which expm1(x) should round to -1. */ -#define SpecialBound 0x1.5ebc4p+6f +/* Value of |x| above which scale overflows without special treatment. + ln(2^(127 + 0.5)) ~ 88.38. */ +#define SpecialBound 0x1.61814bbe4473dp+6 static const struct data { /* These 4 are grouped together so they can be loaded as one quadword, then used with _lane forms of svmla/svmls. */ - float c2, c4, ln2_hi, ln2_lo; - float c0, inv_ln2, c1, c3, special_bound; + struct sv_expf_special_data special_data; + float c0, c1, c3, inv_ln2; + float ln2_hi, ln2_lo, c2, c4; + float special_bound; } data = { + .special_data = SV_EXPF_SPECIAL_DATA, /* Generated using fpminimax. */ - .c0 = 0x1.fffffep-2, .c1 = 0x1.5554aep-3, - .c2 = 0x1.555736p-5, .c3 = 0x1.12287cp-7, - .c4 = 0x1.6b55a2p-10, .inv_ln2 = 0x1.715476p+0f, - .special_bound = SpecialBound, .ln2_lo = 0x1.7f7d1cp-20f, .ln2_hi = 0x1.62e4p-1f, - + .ln2_lo = 0x1.7f7d1cp-20f, + .c0 = 0x1.fffffep-2, + .c1 = 0x1.5554aep-3, + .c2 = 0x1.555736p-5, + .c3 = 0x1.12287cp-7, + .c4 = 0x1.6b55a2p-10, + .inv_ln2 = 0x1.715476p+0f, + .special_bound = SpecialBound, }; -static svfloat32_t NOINLINE -special_case (svfloat32_t x, svbool_t pg) -{ - return sv_call_f32 (expm1f, x, x, pg); -} - -/* Single-precision SVE exp(x) - 1. Maximum error is 1.52 ULP: +/* Single-precision SVE exp(x) - 1. + Maximum error is 1.02 +0.5ULP: _ZGVsMxv_expm1f(0x1.8f4ebcp-2) got 0x1.e859dp-2 want 0x1.e859d4p-2. */ -svfloat32_t SV_NAME_F1 (expm1) (svfloat32_t x, svbool_t pg) +svfloat32_t SV_NAME_F1 (expm1) (svfloat32_t x, const svbool_t pg) { const struct data *d = ptr_barrier (&data); - /* Large, NaN/Inf. */ - svbool_t special = svnot_z (pg, svaclt (pg, x, d->special_bound)); - - if (__glibc_unlikely (svptest_any (pg, special))) - return special_case (x, pg); - /* This vector is reliant on layout of data - it contains constants that can be used with _lane forms of svmla/svmls. Values are: - [ coeff_2, coeff_4, ln2_hi, ln2_lo ]. */ - svfloat32_t lane_constants = svld1rq (svptrue_b32 (), &d->c2); + [ ln2_hi, ln2_lo, coeff_2, coeff_4 ]. */ + svfloat32_t lane_constants = svld1rq (svptrue_b32 (), &d->ln2_hi); /* Reduce argument to smaller range: Let i = round(x / ln2) and f = x - i * ln2, then f is in [-ln2/2, ln2/2]. exp(x) - 1 = 2^i * (expm1(f) + 1) - 1 where 2^i is exact because i is an integer. */ - svfloat32_t j = svmul_x (svptrue_b32 (), x, d->inv_ln2); - j = svrinta_x (pg, j); + svfloat32_t n = svrinta_x (pg, svmul_x (svptrue_b32 (), x, d->inv_ln2)); + + svfloat32_t f = svmls_lane_f32 (x, n, lane_constants, 0); + f = svmls_lane_f32 (f, n, lane_constants, 1); - svfloat32_t f = svmls_lane (x, j, lane_constants, 2); - f = svmls_lane (f, j, lane_constants, 3); + /* Assemble the result. + expm1(x) ~= scale * (poly + 1) - 1 + with scale = 2^i. */ + svfloat32_t scale = svscale_x (pg, sv_f32 (1.0f), svcvt_s32_x (pg, n)); /* Approximate expm1(f) using polynomial. Taylor expansion for expm1(x) has the form: - x + ax^2 + bx^3 + cx^4 .... + x + ax^2 + bx^3 + cx^4 .... So we calculate the polynomial P(f) = a + bf + cf^2 + ... and assemble the approximation expm1(f) ~= f + f^2 * P(f). */ - svfloat32_t p12 = svmla_lane (sv_f32 (d->c1), f, lane_constants, 0); - svfloat32_t p34 = svmla_lane (sv_f32 (d->c3), f, lane_constants, 1); svfloat32_t f2 = svmul_x (svptrue_b32 (), f, f); - svfloat32_t p = svmla_x (pg, p12, f2, p34); + svfloat32_t p12 = svmla_lane (sv_f32 (d->c1), f, lane_constants, 2); + svfloat32_t p34 = svmla_lane (sv_f32 (d->c3), f, lane_constants, 3); - p = svmla_x (pg, sv_f32 (d->c0), f, p); - p = svmla_x (pg, f, f2, p); + svfloat32_t poly = svmla_x (pg, p12, f2, p34); + poly = svmla_x (pg, sv_f32 (d->c0), f, poly); + poly = svmla_x (pg, f, f2, poly); - /* Assemble the result. - expm1(x) ~= 2^i * (p + 1) - 1 - Let t = 2^i. */ - svfloat32_t t = svscale_x (pg, sv_f32 (1.0f), svcvt_s32_x (pg, j)); - return svmla_x (pg, svsub_x (pg, t, 1.0f), p, t); + /* Large, NaN/Inf. */ + svbool_t cmp = svacge_n_f32 (svptrue_b32 (), x, d->special_bound); + /* Fallback to special case for lanes with overflow. */ + if (__glibc_unlikely (svptest_any (cmp, cmp))) + return special_case (poly, n, scale, cmp, &d->special_data); + + return svmla_x (pg, svsub_x (pg, scale, 1.0f), poly, scale); } diff --git a/sysdeps/aarch64/fpu/sv_exp_special_inline.h b/sysdeps/aarch64/fpu/sv_exp_special_inline.h new file mode 100644 index 00000000000..aea7c174373 --- /dev/null +++ b/sysdeps/aarch64/fpu/sv_exp_special_inline.h @@ -0,0 +1,67 @@ +/* SVE helper for double-precision expB special routines + + Copyright (C) 2026 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifndef AARCH64_FPU_SV_EXP_SPECIAL_INLINE_H +#define AARCH64_FPU_SV_EXP_SPECIAL_INLINE_H + +#include "sv_math.h" + +static const struct sv_exp_special_data +{ + uint64_t special_offset, special_bias1, special_bias2; +} SV_EXP_SPECIAL_DATA = { + .special_offset = 0x6000000000000000, + .special_bias1 = 0x7000000000000000, /* 0x1p769. */ + .special_bias2 = 0x3010000000000000, /* 0x1p-254. */ +}; + +static inline svfloat64_t +special_case (svfloat64_t scale, svfloat64_t poly, svfloat64_t n, + const struct sv_exp_special_data *ds) +{ + /* scale = 2^n may overflow, break it up into s=s1*s2, + such that exp = scale + scale*poly can be computed as s1*(s2+s2*poly) + and s1*s1 overflows only if n>0. */ + + /* If n<=0 then set b to 0x6, 0 otherwise. */ + svbool_t p_sign = svcmple (svptrue_b64 (), n, 0.0); /* n <= 0. */ + svuint64_t b = svdup_u64_z ( + p_sign, ds->special_offset); /* Inactive lanes set to 0. */ + /* Set s1 to generate overflow depending on sign of exponent n, + ie. s1 = 0x70...0 - b. */ + svfloat64_t s1 + = svreinterpret_f64 (svsubr_x (svptrue_b64 (), b, ds->special_bias1)); + /* Offset s to avoid overflow in final result if n is below threshold. + ie. s2 = as_u64 (s) - 0x3010...0 + b. */ + svuint64_t biased_scale + = svsub_x (svptrue_b64 (), svreinterpret_u64 (scale), ds->special_bias2); + svfloat64_t s2 + = svreinterpret_f64 (svadd_x (svptrue_b64 (), biased_scale, b)); + + /* |n| > 1280 => 2^(n) overflows. */ + svbool_t p_cmp = svacgt (svptrue_b64 (), n, 1280); + + svfloat64_t r1 = svmul_x (svptrue_b64 (), s1, s1); + svfloat64_t r2 = svmla_x (svptrue_b64 (), s2, s2, poly); + svfloat64_t r0 = svmul_x (svptrue_b64 (), r2, s1); + + return svsel (p_cmp, r1, r0); +} + +#endif diff --git a/sysdeps/aarch64/fpu/sv_expf_special_inline.h b/sysdeps/aarch64/fpu/sv_expf_special_inline.h new file mode 100644 index 00000000000..83b36e3a249 --- /dev/null +++ b/sysdeps/aarch64/fpu/sv_expf_special_inline.h @@ -0,0 +1,72 @@ +/* SVE helper for single-precision expBm1 special routines + + Copyright (C) 2026 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifndef AARCH64_FPU_SV_EXPF_SPECIAL_INLINE_H +#define AARCH64_FPU_SV_EXPF_SPECIAL_INLINE_H + +#include "sv_math.h" + +static const struct sv_expf_special_data +{ + uint32_t exponent_bias, special_offset, special_bias; + float32_t scale_thresh; +} SV_EXPF_SPECIAL_DATA = { + .special_offset = 0x82000000, + .special_bias = 0x7f000000, +}; + +/* Special case routine shared with other expBm1 routines. */ +static inline svfloat32_t +special_exp (svfloat32_t poly, svfloat32_t n, svuint32_t e, svbool_t cmp1, + svfloat32_t scale, const struct sv_expf_special_data *ds) +{ + svbool_t b = svcmple (svptrue_b32 (), n, 0.0f); + svfloat32_t s1 = svreinterpret_f32 ( + svsel (b, sv_u32 (ds->special_offset + ds->special_bias), + sv_u32 (ds->special_bias))); + svfloat32_t s2 + = svreinterpret_f32 (svsub_m (b, e, sv_u32 (ds->special_offset))); + /* Value of n above which scale overflows even with special treatment. */ + svbool_t cmp2 = svacgt (svptrue_b32 (), n, 192.0); + svfloat32_t r2 = svmul_x (svptrue_b32 (), s1, s1); + svfloat32_t r1 + = svmul_x (svptrue_b32 (), svmla_x (svptrue_b32 (), s2, poly, s2), s1); + svfloat32_t r0 = svmla_x (svptrue_b32 (), scale, poly, scale); + svfloat32_t r = svsel (cmp1, r1, r0); + return svsel (cmp2, r2, r); +} + +/* Special case routine for expBm1. */ +static svfloat32_t NOINLINE +special_case (svfloat32_t poly, svfloat32_t n, svfloat32_t scale, + svbool_t cmp1, const struct sv_expf_special_data *ds) +{ + /* Compute unbiased exponent of scale. */ + svuint32_t e = svlsl_x ( + svptrue_b32 (), svreinterpret_u32 (svcvt_s32_x (svptrue_b32 (), n)), 23); + /* compute special exp and subtract 1. */ + svfloat32_t special = svsub_x ( + svptrue_b32 (), special_exp (poly, n, e, cmp1, scale, ds), 1.0f); + /* compute non-special output. */ + svfloat32_t y = svmla_x (svptrue_b32 (), + svsub_x (svptrue_b32 (), scale, 1.0f), poly, scale); + return svsel_f32 (cmp1, special, y); +} + +#endif diff --git a/sysdeps/aarch64/fpu/v_exp_special_case_inline.h b/sysdeps/aarch64/fpu/v_exp_special_case_inline.h new file mode 100644 index 00000000000..8271ae3ef9b --- /dev/null +++ b/sysdeps/aarch64/fpu/v_exp_special_case_inline.h @@ -0,0 +1,49 @@ +/* Helper for double-precision AdvSIMD expB special cases + + Copyright (C) 2026 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifndef AARCH64_FPU_V_EXP_SPECIAL_INLINE_H +#define AARCH64_FPU_V_EXP_SPECIAL_INLINE_H + +#include "v_math.h" + +static const struct v_exp_special_data +{ + uint64x2_t special_offset, special_bias1, special_bias2; +} V_EXP_SPECIAL_DATA = { + .special_offset = V2 (0x6000000000000000), /* 0x1p513. */ + .special_bias1 = V2 (0x7000000000000000), /* 0x1p769. */ + .special_bias2 = V2 (0x3010000000000000), /* 0x1p-254. */ +}; + +static inline float64x2_t VPCS_ATTR +exp_special (float64x2_t poly, float64x2_t n, float64x2_t scale, + float64x2_t scale_bound, const struct v_exp_special_data *ds) +{ + /* 2^n may overflow, break it up into s1*s2. */ + uint64x2_t b = vandq_u64 (vclezq_f64 (n), ds->special_offset); + float64x2_t s1 = vreinterpretq_f64_u64 (vsubq_u64 (ds->special_bias1, b)); + float64x2_t s2 = vreinterpretq_f64_u64 (vaddq_u64 ( + vsubq_u64 (vreinterpretq_u64_f64 (scale), ds->special_bias2), b)); + uint64x2_t cmp2 = vcagtq_f64 (n, scale_bound); + float64x2_t r1 = vmulq_f64 (s1, s1); + float64x2_t r2 = vmulq_f64 (vfmaq_f64 (s2, s2, poly), s1); + return vbslq_f64 (cmp2, r1, r2); +} + +#endif diff --git a/sysdeps/aarch64/fpu/v_expf_special_inline.h b/sysdeps/aarch64/fpu/v_expf_special_inline.h new file mode 100644 index 00000000000..d2701476f84 --- /dev/null +++ b/sysdeps/aarch64/fpu/v_expf_special_inline.h @@ -0,0 +1,53 @@ +/* Helper for single-precision AdvSIMD expB special cases + + Copyright (C) 2026 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifndef AARCH64_FPU_V_EXPF_SPECIAL_INLINE_H +#define AARCH64_FPU_V_EXPF_SPECIAL_INLINE_H + +#include "v_math.h" + +static const struct v_expf_special_data +{ + uint32x4_t special_offset, special_bias; + float32x4_t scale_bound; +} V_EXPF_SPECIAL_DATA = { + .special_offset = V4 (0x82000000), + .special_bias = V4 (0x7f000000), + /* Value of n above which scale overflows even with special treatment. */ + .scale_bound = V4 (0x1.8p+7), /* 192.0f. */ +}; + +static inline float32x4_t VPCS_ATTR +expf_special (float32x4_t poly, float32x4_t n, uint32x4_t e, uint32x4_t cmp1, + float32x4_t scale, const struct v_expf_special_data *ds) +{ + /* 2^n may overflow, break it up into s1*s2. */ + uint32x4_t b = vandq_u32 (vclezq_f32 (n), ds->special_offset); + float32x4_t s1 = vreinterpretq_f32_u32 (vaddq_u32 (b, ds->special_bias)); + float32x4_t s2 = vreinterpretq_f32_u32 (vsubq_u32 (e, b)); + uint32x4_t cmp2 = vcagtq_f32 (n, ds->scale_bound); + float32x4_t r2 = vmulq_f32 (s1, s1); + float32x4_t r1 = vmulq_f32 (vfmaq_f32 (s2, poly, s2), s1); + /* Similar to r1 but avoids double rounding in the subnormal range. */ + float32x4_t r0 = vfmaq_f32 (scale, poly, scale); + float32x4_t r = vbslq_f32 (cmp1, r1, r0); + return vbslq_f32 (cmp2, r2, r); +} + +#endif