From patchwork Wed Dec 10 15:19:26 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pierre Blanchard X-Patchwork-Id: 126334 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from vm01.sourceware.org (localhost [127.0.0.1]) by sourceware.org (Postfix) with ESMTP id 279154BA2E25 for ; Wed, 10 Dec 2025 15:22:20 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 279154BA2E25 Authentication-Results: sourceware.org; dkim=pass (1024-bit key, unprotected) header.d=arm.com header.i=@arm.com header.a=rsa-sha256 header.s=selector1 header.b=JbphpBkn; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.a=rsa-sha256 header.s=selector1 header.b=JbphpBkn X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from GVXPR05CU001.outbound.protection.outlook.com (mail-swedencentralazon11013033.outbound.protection.outlook.com [52.101.83.33]) by sourceware.org (Postfix) with ESMTPS id B813B4BA540C for ; Wed, 10 Dec 2025 15:20:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B813B4BA540C Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org B813B4BA540C Authentication-Results: server2.sourceware.org; arc=pass smtp.remote-ip=52.101.83.33 ARC-Seal: i=3; a=rsa-sha256; d=sourceware.org; s=key; t=1765380056; cv=pass; b=sUaPv4qr1xIgcoaF4tmDlx3Q3yekaIjPOvpqOOmWD2zgrYLICVHfFvU+NxVwWEJX/FUiFoXPEyM/nXRc7WXjJcwUsLfyXXWFRSJw+TDQbNwOaxZ0ea/jRlXHU8ormly94jBvILjoVo6/PWQqUTP7xnwQRtMOzoEX99kBhOBDw68= ARC-Message-Signature: i=3; a=rsa-sha256; d=sourceware.org; s=key; t=1765380056; c=relaxed/simple; bh=W7nrrCCPs8We/N2UpVpp76JJBGDQYp9tjtTdOxg9cdU=; h=DKIM-Signature:DKIM-Signature:From:To:Subject:Date:Message-ID: MIME-Version; b=eEk+ns5x+g5yY+N1U2PwJEXGSOxgrDlQ5yO97MbniNnvBj9u7Z2iJsh6yhK0fq4GG/Pak46NG+XmNRXQh+5cqzbedNmV3pDihV3eBbCbyCRR/qn7U9DZsHuDqgXf3YTfG1h5fIBMIDtWnLNrysLDcG4MCErRo7mV98a6Z139rbg= ARC-Authentication-Results: i=3; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org B813B4BA540C ARC-Seal: i=2; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=pass; b=nXfBG6FNAn8zy+s4XmNKsG29eWssuTnkboF/TLlbXLVewOqs2jJi6hoUxDN9NMVr7xYTb6Q3FSzw0egDNmpvKip8zT6JDthe/LyN/9ui+goFcRyu2GdKfWp9IB6p0xewv83VeMcckLsIXBQQNNjE39qAho011MgCY/YQHMHyLQ3Yk5ZWUQcwq7J87i51UwgLLnn07fvPHcEZmAaNHncQAMvZpCvCmVG7HzoIqXS7AwkdyB5YrnGWQylZDKCzm0kv4x5N6pA/XwzxAtwMHRH4hj3j1nFlz1zxqsxqyocFWd8tgExapL1l7UULHwtfN6Y+/lnh/CLzdbpHJKljKuMdAg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=oqtqe7xmeofnyagScsotfuE+aZ3I1Pe2MwJOLdU9Xhc=; b=mnyd6s7SmdtrkEi2MLrb7jQMzI4Hi/p5YCP9ZzLHhFbt+z9lhDEvQddWlv1+4bZZW0zoqISrTGcCvq15GTWbMuEUHG9IkaSK2AlVoTAD0jai8W2LDdQatwjrh514fxmkRSnikEwmaFo3B1jVdB4cZntgJABbVee0WfBluekL/v75gzX9vrDvvW6Vpq1GnYsc/eNGcSvzBtVWtVnhrS9tTc//FOPrYXv3kGiinkDG3un9pGYgAX+XdjbrbKjw5N9zQ6A/VJGyMTKxaWfQA3wxVrNNIqZ2sGdmulgu+WwkN+XdwdhxNxoOUyUDw8eDlrVh0ZaxFxI8fPSXa5gJQCT9zw== ARC-Authentication-Results: i=2; mx.microsoft.com 1; spf=pass (sender ip is 4.158.2.129) smtp.rcpttodomain=sourceware.org smtp.mailfrom=arm.com; dmarc=pass (p=none sp=none pct=100) action=none header.from=arm.com; dkim=pass (signature was verified) header.d=arm.com; arc=pass (0 oda=1 ltdi=1 spf=[1,1,smtp.mailfrom=arm.com] dmarc=[1,1,header.from=arm.com]) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arm.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=oqtqe7xmeofnyagScsotfuE+aZ3I1Pe2MwJOLdU9Xhc=; b=JbphpBknFeUFF5tx7ciIKaAMKxQKC88Yce/YHOA526VZF497UAJYVm92shZ/NJ1se+mRjYYC59+y+WVolndDI7aOt2ys8hQmYEtWBLlBNqo0p8W7ar0BzgAWTrl72cmUfnL8Fh8gT8SmQZB5zunve1pxFxSOKzhpfB9re93goR0= Received: from DB8PR06CA0061.eurprd06.prod.outlook.com (2603:10a6:10:120::35) by DBAPR08MB5640.eurprd08.prod.outlook.com (2603:10a6:10:1a3::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9388.14; Wed, 10 Dec 2025 15:20:48 +0000 Received: from DB5PEPF00014B92.eurprd02.prod.outlook.com (2603:10a6:10:120:cafe::8f) by DB8PR06CA0061.outlook.office365.com (2603:10a6:10:120::35) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9388.14 via Frontend Transport; Wed, 10 Dec 2025 15:20:48 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 4.158.2.129) smtp.mailfrom=arm.com; dkim=pass (signature was verified) header.d=arm.com;dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 4.158.2.129 as permitted sender) receiver=protection.outlook.com; client-ip=4.158.2.129; helo=outbound-uk1.az.dlp.m.darktrace.com; pr=C Received: from outbound-uk1.az.dlp.m.darktrace.com (4.158.2.129) by DB5PEPF00014B92.mail.protection.outlook.com (10.167.8.230) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9388.8 via Frontend Transport; Wed, 10 Dec 2025 15:20:48 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=maxFWN/OQE4QNxT9CPXzwqqhHjSXsjLawWHdWpIYviTt110pOvmVKw+epHv3WU3WHZJUmpi0EJqRPrEs2ElYtfu5OLrUJ6TS6zNRZ0dgr/yQqURqp7W0Upir+UR5/ddIqnaLa9ZOmE7xrNhPOnfWN1yOzKEx+jH1rnv3BOC8p7L9LdeIyvY7etwbn2Eej+oOIQ04pSu7ikrLPsx2dWPr/GzhIZMNSAEtGAEz7Defb7WJomEWXyGQYSX2vmkUhZjhdnZFujtO7diEMH/zvZyyq2h2dy5BOCLRWhsIqy7BBzpuCASyVeoNMkWeFNZDyDAAx1IvLWaWAql+TWtKxu9b4g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=oqtqe7xmeofnyagScsotfuE+aZ3I1Pe2MwJOLdU9Xhc=; b=r3k+SOmgGVGwZknLf2hOYx0h2fX5AQVvOH8c15BTByuejToaNfHeFJm70AHzHbOzNGmI9nyPkWBHXZus/yZAIHDCqbwapI5dWz1yUbwPuPTn9LKmRpHdEg3AcX69X3WsbGNmF2Jr2BozEr4IuZuwQrkdcFEvDc6/adOuxbl1VCROMoSPHkcYoBPX5gAxzVHlGsIU615ImO85jhFgpRYrZ9jEUpg0aJ2Io6Mt1Vyx/VXznjvk5Z348dTE8tubFk9L7EsFn4gp+IeFIXpGhZlee0jpXi8T4AsTXycG0KbGLs87bvsguLyKUlzra1auTwUCDlnfXIhrCc4ePxZPZfLSxg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 172.205.89.229) smtp.rcpttodomain=sourceware.org smtp.mailfrom=arm.com; dmarc=pass (p=none sp=none pct=100) action=none header.from=arm.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arm.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=oqtqe7xmeofnyagScsotfuE+aZ3I1Pe2MwJOLdU9Xhc=; b=JbphpBknFeUFF5tx7ciIKaAMKxQKC88Yce/YHOA526VZF497UAJYVm92shZ/NJ1se+mRjYYC59+y+WVolndDI7aOt2ys8hQmYEtWBLlBNqo0p8W7ar0BzgAWTrl72cmUfnL8Fh8gT8SmQZB5zunve1pxFxSOKzhpfB9re93goR0= Received: from AM8P191CA0005.EURP191.PROD.OUTLOOK.COM (2603:10a6:20b:21a::10) by PAWPR08MB8934.eurprd08.prod.outlook.com (2603:10a6:102:33e::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9412.9; Wed, 10 Dec 2025 15:19:42 +0000 Received: from AM2PEPF0001C70C.eurprd05.prod.outlook.com (2603:10a6:20b:21a:cafe::97) by AM8P191CA0005.outlook.office365.com (2603:10a6:20b:21a::10) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9412.6 via Frontend Transport; Wed, 10 Dec 2025 15:19:33 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 172.205.89.229) smtp.mailfrom=arm.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 172.205.89.229 as permitted sender) receiver=protection.outlook.com; client-ip=172.205.89.229; helo=nebula.arm.com; pr=C Received: from nebula.arm.com (172.205.89.229) by AM2PEPF0001C70C.mail.protection.outlook.com (10.167.16.200) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9388.8 via Frontend Transport; Wed, 10 Dec 2025 15:19:42 +0000 Received: from AZ-NEU-EXJ02.Arm.com (10.240.25.139) by AZ-NEU-EX04.Arm.com (10.240.25.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.29; Wed, 10 Dec 2025 15:19:39 +0000 Received: from AZ-NEU-EX03.Arm.com (10.240.25.137) by AZ-NEU-EXJ02.Arm.com (10.240.25.139) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.29; Wed, 10 Dec 2025 15:19:39 +0000 Received: from ip-10-252-30-205.eu-west-1.compute.internal (10.252.0.220) by mail.arm.com (10.240.25.137) with Microsoft SMTP Server id 15.2.2562.29 via Frontend Transport; Wed, 10 Dec 2025 15:19:39 +0000 From: Pierre Blanchard To: CC: Pierre Blanchard Subject: [PATCH 1/4] AArch64: Improve AdvSIMD and SVE pow(f). Date: Wed, 10 Dec 2025 15:19:26 +0000 Message-ID: <20251210151929.185631-1-pierre.blanchard@arm.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-EOPAttributedMessage: 1 X-MS-TrafficTypeDiagnostic: AM2PEPF0001C70C:EE_|PAWPR08MB8934:EE_|DB5PEPF00014B92:EE_|DBAPR08MB5640:EE_ X-MS-Office365-Filtering-Correlation-Id: 12b9a6c4-6260-417c-7754-08de37ffb290 x-checkrecipientrouted: true NoDisclaimer: true X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam-Untrusted: BCL:0; ARA:13230040|376014|1800799024|82310400026|36860700013|13003099007; X-Microsoft-Antispam-Message-Info-Original: eC/nlsWjxBwqgUORB9a0M3otXSTFWVA6z30RB4htU2rr+etttn3XbLb9l4zqjBxr+VSETihnt4e9l25Nsbt+XEYLcM4NewktgqhQ5p5IdLiC5zHS2hLNL/khtyB/Y0i3TvYvCYlOx2VHAFTTCEJJfE2bEwqYGePomwiZ8egByAMUNFTJJSmbHPsnUraDBts2YOeKHXY3JIysaeUSXk3Zq6J49zetqUj4xnu1ChV4W4S9BHj14trttj3elfkogcLb0FmCNDdMgJb3IJWV+hadt48lXjYNLEohxhwlxqw1WQ40vbbC2yOHTTjUDE6Ddcp+dlq+teNnMuC7w1J11UEwOfNQoYuu7AAELA3AFok+yJmxN0QitEHTFKFj0R5Jm5JNkG1asjwHzgwqm+Ls4UjaLKHehXa889Fj8+9a9JJNQVWI4WEtKDWoFewmN22ltDVuXlcfbnfI4jqv7OfPzCEqWsuAZGwtMekABbRu03apJWMVWClWe6ZbH2x9lJL7Rbh/Ksh/c1sIa2k+wLXv8waSYtolQ3T0d15fr2z3GJcI2v32W5yLMcaA359J/x0geDAHF4kN1ZbDJVDj9xmZ6q+F6ZP1r3yArAsZOLc6XvaPrfkGJX59gAMF4P9QonhEzHfYxpQbbvZ+7hHCG1E4c/7no5+ZD+c5FVM6r8f29tf/zG8TFh6zJeBiSc/KQgbMt9jHxQetpdqD6/UXriDm7s1yuKNaZAiSoYtQVU1IXryviNzUs7/7AXC5nIPu624Gh0EnlPrA9wTI9yZ0XjaqCi/EdFO9nytgIuHvaD2Ex/Qp0Gn2Sqhbbf4B00e0UpbqjY49yBFoEWiKLoMkKV582yP2AY/VpXbYdw1JYqiaQTcPMFQvILKgds6MtqGruv8j3IbL2p+HUO4pId4fCBv5hROmw7QkoElaZCwkqXAecwJ0MH1cxgSgFW9u7q4FKZyNnOjMER7HazE0BmatLx6kiWkd8hBGkLHTqxc5dPUgrCsspB27CQYncYd01KoXIuLy97gGQ3rVl9kz0v25CF31jXsEB0GYh/HEXX05cHdGxyZL5EcSIhcK3ROsi8NQdhL5OwL8/oeyiylBRTEEB2W1x+DCzK19UPHsDHrp7QUHNjTAHYZOoJxhHXVs+aUywag2zHL92MEpZsdvaV2wBWr8gLpQ7xPE6Rl2BStafcs83uGO86fJiGLLWS/RiMEoaHt2u5J2wcfoCYTM5AoQcy5h5Eq0+Glp0herGnvw2zsJkL7C/A2XDx5onEGdlDJeC4MlAfOGLanO+0lPQUzQToaEhUYZ7eWWqBhDwD9kNlfhntkFtq3ar5zNE6HEJmGl4CiaPGY/5urbw4OZBonz532ruXp13CAIVySsjmY8WwmI/YM8vQRBT27lTuco2LoFzmscG/sV3EXsyODK/Dw7IRtRVqgNlCx1zFe4tdQ4wrUzw3iEColOpLGmhzulAN9J0FpulZBT2bg2T/qicVNPo7D01wcVDbEDUZCxP/Bwr2NTHfDBUHyksnUxGxI1CDwgkOojUtYMtGjL1rHa7F7lq5zZdDtjQXHKf3jTmyRxbvjgQJHG/fE= X-Forefront-Antispam-Report-Untrusted: CIP:172.205.89.229; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:nebula.arm.com; PTR:InfoDomainNonexistent; CAT:NONE; SFS:(13230040)(376014)(1800799024)(82310400026)(36860700013)(13003099007); DIR:OUT; SFP:1101; X-MS-Exchange-Transport-CrossTenantHeadersStamped: PAWPR08MB8934 X-MS-Exchange-Transport-CrossTenantHeadersStripped: DB5PEPF00014B92.eurprd02.prod.outlook.com X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id-Prvs: 03175268-8a58-40f7-adf4-08de37ff8b2e X-Microsoft-Antispam: BCL:0; ARA:13230040|35042699022|14060799003|36860700013|82310400026|376014|1800799024|13003099007; X-Microsoft-Antispam-Message-Info: 9yohEH1taPWBBs+HqBb+mnWq9QZVjiZQctqJXQT8oXo9yiMrX6uDRaLAU4KJbuzj/bZV3jjVdVdf5yv7/QZfTQB5JkmGQ2RMubkIEW8a9TTllyUNPfET9WLpZnPu47Z2Jr4ojHW4nYPaS8nXjY3t0d51IoXI+z3eJ245OO/JYhRpQIsAn8DdEvxcv1/wbyPJyOWXoWHkRQ1S+g12nI8ZTG4y/oPoUs3ZRY6OkwHMqsU7kVble2AOVevtX1kHxq3xbgkRSfqBoJDZ21SF/zK/DTJ4GjhWxFz6dwkFcvWzzCC4TOV8CB5DqsKzTQAgkRlrn10JvXZuaJAOvk4MH71LiV4jLQU3zSn00aWtUtDFdC6LbzNVPMN3a9gIkfpsWqPnORG4uEGQsisBSUCXXAgYlqtfCU7BHKeu0KXIt/CDaA1loo7mx/YMR4rGHQq0sQlJguNamdhFEH1XDgjw0pBDShHPjWM+sS08RrMI9FenciFrSBbuR/c99OU607df0H39kvEIXL1KudRPm2eYvycoMOr/CFIUBdPsBn0kU2IQ16EsvrtvX7k3ncqWIOsje3f5pNwA+5cejBSnC57kq+Zyi3HeBhcn7PY/44+BmfepKySBY+38toGMNLGdMrDoWKsB2fgcNYajnCbXnVEgICJep3Kx3VUhhqSotBmm57hPtHkpzh60EVhNntbUtkTSVMEklpgE/Amnim9mhVKpWH/K4hMOTqXoHZC0S6CJfOn8CMTZVwNhh8PR6VF6+HHlYWchr469tUjRWuEt9Usf+xwPFXVNjc4qo2uh494n3PTpdeZdSEdQtsATRV005rSXJLA2G92iuJFxMpKoReQaXCdbT5Q6d3P2YJf9BqLGwk9/Hl4x+8WZQj8hkS0cTqSA9DthUT59XXevk7AkXH72KQ9N6lZEicqedGK3VskMssM0VuwrYG6mjHVO0D9tZzMHfKesroH1RSOVLAUb1X7lGJ8zgNPZJjT8ueELv1BMQGZFfVw80gNyjlLSmAxfLs9OTw3uaWsNgMktwSnq+6vtZuMvRHG7rhau9B4kXc69zXB7sWTeUTICfENcOHWoJYwGz5cvmS4S5AxIUpVV4vdealaNn7YsY1JiUQMPYROEfPLztmRtwbCKI2zLKTu3Ygcd3+xlHeunhdKikI9sEvJ1+8MgHdEeYShghFT+G4Xb5nFQZPmzRB6n7vfzwDPZVTNf5PYPibuJeDZAqaDwKCQSg0BDcdlxWy3u5HKSeYsWhMe9GEnzb3dE4p7Xj2wYR95d2T2EyoMmG1C5uiuPJMMIvii+JITfO72o3uzuBczWRrHt+deDuRYO8dgvv2wVdDdncjRA8z6AENg1Fhkrdvs7YkDwRFBVH8Ukfrd6bDoJvptvA8mwKlphq9YR0tNY2Ay2NGwrK5kLXW2JaiEioYobj8sKK+XPsjKjsedyyBwt8pCTOrlfcoJVbk7KIzDElghlGLV4/sgFBF11bnLfBzDYSyhTg/5PzCcRip5lBFs5/l53qWkKzBpfXtmIKMNdMKsKBkie0oTW8CdZPdlnY20N0Kn1vKyqQDwnls0ShwOhu2Bqlp8= X-Forefront-Antispam-Report: CIP:4.158.2.129; CTRY:GB; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:outbound-uk1.az.dlp.m.darktrace.com; PTR:InfoDomainNonexistent; CAT:NONE; SFS:(13230040)(35042699022)(14060799003)(36860700013)(82310400026)(376014)(1800799024)(13003099007); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Dec 2025 15:20:48.4729 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 12b9a6c4-6260-417c-7754-08de37ffb290 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[4.158.2.129]; Helo=[outbound-uk1.az.dlp.m.darktrace.com] X-MS-Exchange-CrossTenant-AuthSource: DB5PEPF00014B92.eurprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DBAPR08MB5640 X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_BLOCKED, RCVD_IN_MSPIKE_H2, RCVD_IN_VALIDITY_RPBL_BLOCKED, RCVD_IN_VALIDITY_SAFE_BLOCKED, SPF_HELO_NONE, SPF_NONE, TXREP, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org Optimize handling of subnormal x and/or negative x. Some cleanup in attributes, macros and improving overall consistency. Move core computation to header Introduce config parameter to turn sign_bias on/off. Performance improvement on V1 with GCC@15: - AdvSIMD pow: 10-15% on subnormals. - AdvSIMD powf: 30 to 70% on subnormals or x < 0, <=3% on x > 0. - SVE pow: 10-15% on subnormals, <=3% otherwise. - SVE powf: no significant variations in codegen/perf. --- Ok for master? If so please commit for me as I don't have commit rights. Likewise if OK for backport this optimization to release branches as far back as 2.40 (release that introduced AArch64 vector pow and powf). Thanks, Pierre sysdeps/aarch64/fpu/finite_pow.h | 126 ----------- sysdeps/aarch64/fpu/pow_advsimd.c | 294 +++++++++---------------- sysdeps/aarch64/fpu/pow_sve.c | 307 +-------------------------- sysdeps/aarch64/fpu/powf_advsimd.c | 274 ++++++++++-------------- sysdeps/aarch64/fpu/powf_sve.c | 193 +---------------- sysdeps/aarch64/fpu/sv_pow_inline.h | 304 ++++++++++++++++++++++++++ sysdeps/aarch64/fpu/sv_powf_inline.h | 211 ++++++++++++++++++ sysdeps/aarch64/fpu/v_pow_inline.h | 193 +++++++++++++++++ sysdeps/aarch64/fpu/v_powf_inline.h | 116 ++++++++++ sysdeps/aarch64/fpu/v_powrf_inline.h | 246 +++++++++++++++++++++ 10 files changed, 1303 insertions(+), 961 deletions(-) create mode 100644 sysdeps/aarch64/fpu/sv_pow_inline.h create mode 100644 sysdeps/aarch64/fpu/sv_powf_inline.h create mode 100644 sysdeps/aarch64/fpu/v_pow_inline.h create mode 100644 sysdeps/aarch64/fpu/v_powf_inline.h create mode 100644 sysdeps/aarch64/fpu/v_powrf_inline.h diff --git a/sysdeps/aarch64/fpu/finite_pow.h b/sysdeps/aarch64/fpu/finite_pow.h index 9fd535e77d..70e1c87751 100644 --- a/sysdeps/aarch64/fpu/finite_pow.h +++ b/sysdeps/aarch64/fpu/finite_pow.h @@ -19,8 +19,6 @@ #include "math_config.h" -/* Scalar version of pow used for fallbacks in vector implementations. */ - /* Data is defined in v_pow_log_data.c. */ #define N_LOG (1 << V_POW_LOG_TABLE_BITS) #define Off 0x3fe6955500000000 @@ -181,49 +179,6 @@ exp_inline (double x, double xtail, uint32_t sign_bias) return scale + scale * tmp; } -/* Computes exp(x+xtail) where |xtail| < 2^-8/N and |xtail| <= |x|. - A version of exp_inline that is not inlined and for which sign_bias is - equal to 0. */ -static double NOINLINE -exp_nosignbias (double x, double xtail) -{ - uint32_t abstop = top12 (x) & 0x7ff; - if (__glibc_unlikely (abstop - SmallExp >= ThresExp)) - { - /* Avoid spurious underflow for tiny x. */ - if (abstop - SmallExp >= 0x80000000) - return 1.0; - /* Note: inf and nan are already handled. */ - if (abstop >= top12 (1024.0)) - return asuint64 (x) >> 63 ? 0.0 : INFINITY; - /* Large x is special cased below. */ - abstop = 0; - } - - /* exp(x) = 2^(k/N) * exp(r), with exp(r) in [2^(-1/2N),2^(1/2N)]. */ - /* x = ln2/N*k + r, with k integer and r in [-ln2/2N, ln2/2N]. */ - double z = InvLn2N * x; - double kd = round (z); - uint64_t ki = lround (z); - double r = x - kd * Ln2HiN - kd * Ln2LoN; - /* The code assumes 2^-200 < |xtail| < 2^-8/N. */ - r += xtail; - /* 2^(k/N) ~= scale. */ - uint64_t idx = ki & (N_EXP - 1); - uint64_t top = ki << (52 - V_POW_EXP_TABLE_BITS); - /* This is only a valid scale when -1023*N < k < 1024*N. */ - uint64_t sbits = SBits[idx] + top; - /* exp(x) = 2^(k/N) * exp(r) ~= scale + scale * (tail + exp(r) - 1). */ - double r2 = r * r; - double tmp = r + r2 * Cs[0] + r * r2 * (Cs[1] + r * Cs[2]); - if (__glibc_unlikely (abstop == 0)) - return special_case (tmp, sbits, ki); - double scale = asdouble (sbits); - /* Note: tmp == 0 or |tmp| > 2^-200 and scale > 2^-739, so there - is no spurious underflow here even without fma. */ - return scale + scale * tmp; -} - /* Returns 0 if not int, 1 if odd int, 2 if even int. The argument is the bit representation of a non-zero finite floating-point value. */ static inline int @@ -247,84 +202,3 @@ zeroinfnan (uint64_t i) { return 2 * i - 1 >= 2 * asuint64 (INFINITY) - 1; } - -static double NOINLINE -pow_scalar_special_case (double x, double y) -{ - uint32_t sign_bias = 0; - uint64_t ix, iy; - uint32_t topx, topy; - - ix = asuint64 (x); - iy = asuint64 (y); - topx = top12 (x); - topy = top12 (y); - if (__glibc_unlikely (topx - SmallPowX >= ThresPowX - || (topy & 0x7ff) - SmallPowY >= ThresPowY)) - { - /* Note: if |y| > 1075 * ln2 * 2^53 ~= 0x1.749p62 then pow(x,y) = inf/0 - and if |y| < 2^-54 / 1075 ~= 0x1.e7b6p-65 then pow(x,y) = +-1. */ - /* Special cases: (x < 0x1p-126 or inf or nan) or - (|y| < 0x1p-65 or |y| >= 0x1p63 or nan). */ - if (__glibc_unlikely (zeroinfnan (iy))) - { - if (2 * iy == 0) - return issignaling_inline (x) ? x + y : 1.0; - if (ix == asuint64 (1.0)) - return issignaling_inline (y) ? x + y : 1.0; - if (2 * ix > 2 * asuint64 (INFINITY) - || 2 * iy > 2 * asuint64 (INFINITY)) - return x + y; - if (2 * ix == 2 * asuint64 (1.0)) - return 1.0; - if ((2 * ix < 2 * asuint64 (1.0)) == !(iy >> 63)) - return 0.0; /* |x|<1 && y==inf or |x|>1 && y==-inf. */ - return y * y; - } - if (__glibc_unlikely (zeroinfnan (ix))) - { - double x2 = x * x; - if (ix >> 63 && checkint (iy) == 1) - { - x2 = -x2; - sign_bias = 1; - } - return iy >> 63 ? 1 / x2 : x2; - } - /* Here x and y are non-zero finite. */ - if (ix >> 63) - { - /* Finite x < 0. */ - int yint = checkint (iy); - if (yint == 0) - return __builtin_nan (""); - if (yint == 1) - sign_bias = SignBias; - ix &= 0x7fffffffffffffff; - topx &= 0x7ff; - } - if ((topy & 0x7ff) - SmallPowY >= ThresPowY) - { - /* Note: sign_bias == 0 here because y is not odd. */ - if (ix == asuint64 (1.0)) - return 1.0; - /* |y| < 2^-65, x^y ~= 1 + y*log(x). */ - if ((topy & 0x7ff) < SmallPowY) - return 1.0; - return (ix > asuint64 (1.0)) == (topy < 0x800) ? INFINITY : 0; - } - if (topx == 0) - { - /* Normalize subnormal x so exponent becomes negative. */ - ix = asuint64 (x * 0x1p52); - ix &= 0x7fffffffffffffff; - ix -= 52ULL << 52; - } - } - - double lo; - double hi = log_inline (ix, &lo); - double ehi = y * hi; - double elo = y * lo + fma (y, hi, -ehi); - return exp_inline (ehi, elo, sign_bias); -} diff --git a/sysdeps/aarch64/fpu/pow_advsimd.c b/sysdeps/aarch64/fpu/pow_advsimd.c index e1b3b20f31..1cfaef502b 100644 --- a/sysdeps/aarch64/fpu/pow_advsimd.c +++ b/sysdeps/aarch64/fpu/pow_advsimd.c @@ -19,186 +19,109 @@ #include "v_math.h" -/* Defines parameters of the approximation and scalar fallback. */ -#include "finite_pow.h" +#include "v_pow_inline.h" -#define VecSmallPowX v_u64 (SmallPowX) -#define VecThresPowX v_u64 (ThresPowX) -#define VecSmallPowY v_u64 (SmallPowY) -#define VecThresPowY v_u64 (ThresPowY) - -static const struct data -{ - uint64x2_t inf; - float64x2_t small_powx; - uint64x2_t offset, mask; - uint64x2_t mask_sub_0, mask_sub_1; - float64x2_t log_c0, log_c2, log_c4, log_c5; - double log_c1, log_c3; - double ln2_lo, ln2_hi; - uint64x2_t small_exp, thres_exp; - double ln2_lo_n, ln2_hi_n; - double inv_ln2_n, exp_c2; - float64x2_t exp_c0, exp_c1; -} data = { - /* Power threshold. */ - .inf = V2 (0x7ff0000000000000), - .small_powx = V2 (0x1p-126), - .offset = V2 (Off), - .mask = V2 (0xfffULL << 52), - .mask_sub_0 = V2 (1ULL << 52), - .mask_sub_1 = V2 (52ULL << 52), - /* Coefficients copied from v_pow_log_data.c - relative error: 0x1.11922ap-70 in [-0x1.6bp-8, 0x1.6bp-8] - Coefficients are scaled to match the scaling during evaluation. */ - .log_c0 = V2 (0x1.555555555556p-2 * -2), - .log_c1 = -0x1.0000000000006p-2 * -2, - .log_c2 = V2 (0x1.999999959554ep-3 * 4), - .log_c3 = -0x1.555555529a47ap-3 * 4, - .log_c4 = V2 (0x1.2495b9b4845e9p-3 * -8), - .log_c5 = V2 (-0x1.0002b8b263fc3p-3 * -8), - .ln2_hi = 0x1.62e42fefa3800p-1, - .ln2_lo = 0x1.ef35793c76730p-45, - /* Polynomial coefficients: abs error: 1.43*2^-58, ulp error: 0.549 - (0.550 without fma) if |x| < ln2/512. */ - .exp_c0 = V2 (0x1.fffffffffffd4p-2), - .exp_c1 = V2 (0x1.5555571d6ef9p-3), - .exp_c2 = 0x1.5555576a5adcep-5, - .small_exp = V2 (0x3c90000000000000), - .thres_exp = V2 (0x03f0000000000000), - .inv_ln2_n = 0x1.71547652b82fep8, /* N/ln2. */ - .ln2_hi_n = 0x1.62e42fefc0000p-9, /* ln2/N. */ - .ln2_lo_n = -0x1.c610ca86c3899p-45, -}; - -/* This version implements an algorithm close to scalar pow but - - does not implement the trick in the exp's specialcase subroutine to avoid - double-rounding, - - does not use a tail in the exponential core computation, - - and pow's exp polynomial order and table bits might differ. - - Maximum measured error is 1.04 ULPs: - _ZGVnN2vv_pow(0x1.024a3e56b3c3p-136, 0x1.87910248b58acp-13) - got 0x1.f71162f473251p-1 - want 0x1.f71162f473252p-1. */ - -static inline float64x2_t -v_masked_lookup_f64 (const double *table, uint64x2_t i) +static double NOINLINE +pow_scalar_special_case (double x, double y) { - return (float64x2_t){ - table[(i[0] >> (52 - V_POW_LOG_TABLE_BITS)) & (N_LOG - 1)], - table[(i[1] >> (52 - V_POW_LOG_TABLE_BITS)) & (N_LOG - 1)] - }; -} + uint32_t sign_bias = 0; + uint64_t ix, iy; + uint32_t topx, topy; + + ix = asuint64 (x); + iy = asuint64 (y); + topx = top12 (x); + topy = top12 (y); + /* Special cases: (x < 0x1p-126 or inf or nan) or + (|y| < 0x1p-65 or |y| >= 0x1p63 or nan). */ + if (__glibc_unlikely (topx - SmallPowX >= ThresPowX + || (topy & 0x7ff) - SmallPowY >= ThresPowY)) + { + /* |y| is 0, Inf or NaN. */ + if (__glibc_unlikely (zeroinfnan (iy))) + { + if (2 * iy == 0) + return issignaling_inline (x) ? x + y : 1.0; + if (ix == asuint64 (1.0)) + return issignaling_inline (y) ? x + y : 1.0; + if (2 * ix > 2 * asuint64 (INFINITY) + || 2 * iy > 2 * asuint64 (INFINITY)) + return x + y; + if (2 * ix == 2 * asuint64 (1.0)) + return 1.0; + if ((2 * ix < 2 * asuint64 (1.0)) == !(iy >> 63)) + return 0.0; /* |x|<1 && y==inf or |x|>1 && y==-inf. */ + return y * y; + } + /* |x| is 0, Inf or NaN. */ + if (__glibc_unlikely (zeroinfnan (ix))) + { + double x2 = x * x; + if (ix >> 63 && checkint (iy) == 1) + { + x2 = -x2; + sign_bias = 1; + } + return iy >> 63 ? 1 / x2 : x2; + } + /* Here x and y are non-zero finite. */ + /* Finite negative x returns NaN unless y is integer. */ + if (ix >> 63) + { + /* Finite x < 0. */ + int yint = checkint (iy); + if (yint == 0) + return __builtin_nan (""); + if (yint == 1) + sign_bias = SignBias; + ix &= 0x7fffffffffffffff; + topx &= 0x7ff; + } + /* Note: if |y| > 1075 * ln2 * 2^53 ~= 0x1.749p62 then pow(x,y) = inf/0 + and if |y| < 2^-54 / 1075 ~= 0x1.e7b6p-65 then pow(x,y) = +-1. */ + if ((topy & 0x7ff) - SmallPowY >= ThresPowY) + { + /* Note: sign_bias == 0 here because y is not odd. */ + if (ix == asuint64 (1.0)) + return 1.0; + /* |y| < 2^-65, x^y ~= 1 + y*log(x). */ + if ((topy & 0x7ff) < SmallPowY) + return 1.0; + return (ix > asuint64 (1.0)) == (topy < 0x800) ? INFINITY : 0; + } + if (topx == 0) + { + /* Normalize subnormal x so exponent becomes negative. */ + ix = asuint64 (x * 0x1p52); + ix &= 0x7fffffffffffffff; + ix -= 52ULL << 52; + } + } -/* Compute y+TAIL = log(x) where the rounded result is y and TAIL has about - additional 15 bits precision. IX is the bit representation of x, but - normalized in the subnormal range using the sign bit for the exponent. */ -static inline float64x2_t -v_log_inline (uint64x2_t ix, float64x2_t *tail, const struct data *d) -{ - /* x = 2^k z; where z is in range [OFF,2*OFF) and exact. - The range is split into N subintervals. - The ith subinterval contains z and c is near its center. */ - uint64x2_t tmp = vsubq_u64 (ix, d->offset); - int64x2_t k = vshrq_n_s64 (vreinterpretq_s64_u64 (tmp), 52); - uint64x2_t iz = vsubq_u64 (ix, vandq_u64 (tmp, d->mask)); - float64x2_t z = vreinterpretq_f64_u64 (iz); - float64x2_t kd = vcvtq_f64_s64 (k); - /* log(x) = k*Ln2 + log(c) + log1p(z/c-1). */ - float64x2_t invc = v_masked_lookup_f64 (__v_pow_log_data.invc, tmp); - float64x2_t logc = v_masked_lookup_f64 (__v_pow_log_data.logc, tmp); - float64x2_t logctail = v_masked_lookup_f64 (__v_pow_log_data.logctail, tmp); - /* Note: 1/c is j/N or j/N/2 where j is an integer in [N,2N) and - |z/c - 1| < 1/N, so r = z/c - 1 is exactly representible. */ - float64x2_t r = vfmaq_f64 (v_f64 (-1.0), z, invc); - /* k*Ln2 + log(c) + r. */ - float64x2_t ln2 = vld1q_f64 (&d->ln2_lo); - float64x2_t t1 = vfmaq_laneq_f64 (logc, kd, ln2, 1); - float64x2_t t2 = vaddq_f64 (t1, r); - float64x2_t lo1 = vfmaq_laneq_f64 (logctail, kd, ln2, 0); - float64x2_t lo2 = vaddq_f64 (vsubq_f64 (t1, t2), r); - /* Evaluation is optimized assuming superscalar pipelined execution. */ - float64x2_t ar = vmulq_f64 (v_f64 (-0.5), r); - float64x2_t ar2 = vmulq_f64 (r, ar); - float64x2_t ar3 = vmulq_f64 (r, ar2); - /* k*Ln2 + log(c) + r + A[0]*r*r. */ - float64x2_t hi = vaddq_f64 (t2, ar2); - float64x2_t lo3 = vfmaq_f64 (vnegq_f64 (ar2), ar, r); - float64x2_t lo4 = vaddq_f64 (vsubq_f64 (t2, hi), ar2); - /* p = log1p(r) - r - A[0]*r*r. */ - float64x2_t odd_coeffs = vld1q_f64 (&d->log_c1); - float64x2_t a56 = vfmaq_f64 (d->log_c4, r, d->log_c5); - float64x2_t a34 = vfmaq_laneq_f64 (d->log_c2, r, odd_coeffs, 1); - float64x2_t a12 = vfmaq_laneq_f64 (d->log_c0, r, odd_coeffs, 0); - float64x2_t p = vfmaq_f64 (a34, ar2, a56); - p = vfmaq_f64 (a12, ar2, p); - p = vmulq_f64 (ar3, p); - float64x2_t lo - = vaddq_f64 (vaddq_f64 (vaddq_f64 (vaddq_f64 (lo1, lo2), lo3), lo4), p); - float64x2_t y = vaddq_f64 (hi, lo); - *tail = vaddq_f64 (vsubq_f64 (hi, y), lo); - return y; + /* Core computation of exp (y * log (x)). */ + double lo; + double hi = log_inline (ix, &lo); + double ehi = y * hi; + double elo = y * lo + fma (y, hi, -ehi); + return exp_inline (ehi, elo, sign_bias); } static float64x2_t VPCS_ATTR NOINLINE -exp_special_case (float64x2_t x, float64x2_t xtail) -{ - return (float64x2_t){ exp_nosignbias (x[0], xtail[0]), - exp_nosignbias (x[1], xtail[1]) }; -} - -/* Computes sign*exp(x+xtail) where |xtail| < 2^-8/N and |xtail| <= |x|. */ -static inline float64x2_t -v_exp_inline (float64x2_t x, float64x2_t neg_xtail, const struct data *d) -{ - /* Fallback to scalar exp_inline for all lanes if any lane - contains value of x s.t. |x| <= 2^-54 or >= 512. */ - uint64x2_t uoflowx = vcgeq_u64 ( - vsubq_u64 (vreinterpretq_u64_f64 (vabsq_f64 (x)), d->small_exp), - d->thres_exp); - if (__glibc_unlikely (v_any_u64 (uoflowx))) - return exp_special_case (x, vnegq_f64 (neg_xtail)); - - /* exp(x) = 2^(k/N) * exp(r), with exp(r) in [2^(-1/2N),2^(1/2N)]. */ - /* x = ln2/N*k + r, with k integer and r in [-ln2/2N, ln2/2N]. */ - /* z - kd is in [-1, 1] in non-nearest rounding modes. */ - float64x2_t exp_consts = vld1q_f64 (&d->inv_ln2_n); - float64x2_t z = vmulq_laneq_f64 (x, exp_consts, 0); - float64x2_t kd = vrndnq_f64 (z); - uint64x2_t ki = vreinterpretq_u64_s64 (vcvtaq_s64_f64 (z)); - float64x2_t ln2_n = vld1q_f64 (&d->ln2_lo_n); - float64x2_t r = vfmsq_laneq_f64 (x, kd, ln2_n, 1); - r = vfmsq_laneq_f64 (r, kd, ln2_n, 0); - /* The code assumes 2^-200 < |xtail| < 2^-8/N. */ - r = vsubq_f64 (r, neg_xtail); - /* 2^(k/N) ~= scale. */ - uint64x2_t idx = vandq_u64 (ki, v_u64 (N_EXP - 1)); - uint64x2_t top = vshlq_n_u64 (ki, 52 - V_POW_EXP_TABLE_BITS); - /* This is only a valid scale when -1023*N < k < 1024*N. */ - uint64x2_t sbits = v_lookup_u64 (SBits, idx); - sbits = vaddq_u64 (sbits, top); - /* exp(x) = 2^(k/N) * exp(r) ~= scale + scale * (exp(r) - 1). */ - float64x2_t r2 = vmulq_f64 (r, r); - float64x2_t tmp = vfmaq_laneq_f64 (d->exp_c1, r, exp_consts, 1); - tmp = vfmaq_f64 (d->exp_c0, r, tmp); - tmp = vfmaq_f64 (r, r2, tmp); - float64x2_t scale = vreinterpretq_f64_u64 (sbits); - /* Note: tmp == 0 or |tmp| > 2^-200 and scale > 2^-739, so there - is no spurious underflow here even without fma. */ - return vfmaq_f64 (scale, scale, tmp); -} - -static float64x2_t NOINLINE VPCS_ATTR scalar_fallback (float64x2_t x, float64x2_t y) { return (float64x2_t){ pow_scalar_special_case (x[0], y[0]), pow_scalar_special_case (x[1], y[1]) }; } +/* Implementation of AdvSIMD pow. + Maximum measured error is 1.04 ULPs: + _ZGVnN2vv_pow(0x1.024a3e56b3c3p-136, 0x1.87910248b58acp-13) + got 0x1.f71162f473251p-1 + want 0x1.f71162f473252p-1. */ float64x2_t VPCS_ATTR V_NAME_D2 (pow) (float64x2_t x, float64x2_t y) { const struct data *d = ptr_barrier (&data); + /* Case of x <= 0 is too complicated to be vectorised efficiently here, fallback to scalar pow for all lanes if any x < 0 detected. */ if (v_any_u64 (vclezq_s64 (vreinterpretq_s64_f64 (x)))) @@ -206,43 +129,30 @@ float64x2_t VPCS_ATTR V_NAME_D2 (pow) (float64x2_t x, float64x2_t y) uint64x2_t vix = vreinterpretq_u64_f64 (x); uint64x2_t viy = vreinterpretq_u64_f64 (y); - uint64x2_t iay = vandq_u64 (viy, d->inf); /* Special cases of x or y. The case y==0 does not trigger a special case, since in this case it is necessary to fix the result only if x is a signalling nan, which already triggers a special case. We test y==0 directly in the scalar fallback. */ - uint64x2_t iax = vandq_u64 (vix, d->inf); - uint64x2_t specialx = vcgeq_u64 (iax, d->inf); - uint64x2_t specialy = vcgeq_u64 (iay, d->inf); - uint64x2_t special = vorrq_u64 (specialx, specialy); + uint64x2_t x_is_inf_or_nan = vcgeq_u64 (vandq_u64 (vix, d->inf), d->inf); + uint64x2_t y_is_inf_or_nan = vcgeq_u64 (vandq_u64 (viy, d->inf), d->inf); + uint64x2_t special = vorrq_u64 (x_is_inf_or_nan, y_is_inf_or_nan); + /* Fallback to scalar on all lanes if any lane is inf or nan. */ if (__glibc_unlikely (v_any_u64 (special))) return scalar_fallback (x, y); - /* Small cases of x: |x| < 0x1p-126. */ - uint64x2_t smallx = vcaltq_f64 (x, d->small_powx); - if (__glibc_unlikely (v_any_u64 (smallx))) + /* Cases of subnormal x: |x| < 0x1p-1022. */ + uint64x2_t x_is_subnormal = vcaltq_f64 (x, d->subnormal_bound); + if (__glibc_unlikely (v_any_u64 (x_is_subnormal))) { - /* Update ix if top 12 bits of x are 0. */ - uint64x2_t sub_x = vceqzq_u64 (vshrq_n_u64 (vix, 52)); - if (__glibc_unlikely (v_any_u64 (sub_x))) - { - /* Normalize subnormal x so exponent becomes negative. */ - uint64x2_t vix_norm = vreinterpretq_u64_f64 ( - vabsq_f64 (vmulq_f64 (x, vcvtq_f64_u64 (d->mask_sub_0)))); - vix_norm = vsubq_u64 (vix_norm, d->mask_sub_1); - vix = vbslq_u64 (sub_x, vix_norm, vix); - } + /* Normalize subnormal x so exponent becomes negative. */ + uint64x2_t vix_norm = vreinterpretq_u64_f64 ( + vabsq_f64 (vmulq_f64 (x, d->subnormal_scale))); + vix_norm = vsubq_u64 (vix_norm, d->subnormal_bias); + x = vbslq_f64 (x_is_subnormal, vreinterpretq_f64_u64 (vix_norm), x); } - /* Vector Log(ix, &lo). */ - float64x2_t vlo; - float64x2_t vhi = v_log_inline (vix, &vlo, d); - - /* Vector Exp(y_loghi, y_loglo). */ - float64x2_t vehi = vmulq_f64 (y, vhi); - float64x2_t vemi = vfmsq_f64 (vehi, y, vhi); - float64x2_t neg_velo = vfmsq_f64 (vemi, y, vlo); - return v_exp_inline (vehi, neg_velo, d); + /* Core computation of exp (y * log (x)). */ + return v_pow_inline (x, y, d); } diff --git a/sysdeps/aarch64/fpu/pow_sve.c b/sysdeps/aarch64/fpu/pow_sve.c index e09a52f115..59154d63ad 100644 --- a/sysdeps/aarch64/fpu/pow_sve.c +++ b/sysdeps/aarch64/fpu/pow_sve.c @@ -42,295 +42,8 @@ #include "math_config.h" #include "sv_math.h" -/* Data is defined in v_pow_log_data.c. */ -#define N_LOG (1 << V_POW_LOG_TABLE_BITS) -#define Off 0x3fe6955500000000 - -/* Data is defined in v_pow_exp_data.c. */ -#define N_EXP (1 << V_POW_EXP_TABLE_BITS) -#define SignBias (0x800 << V_POW_EXP_TABLE_BITS) -#define SmallExp 0x3c9 /* top12(0x1p-54). */ -#define BigExp 0x408 /* top12(512.). */ -#define ThresExp 0x03f /* BigExp - SmallExp. */ -#define HugeExp 0x409 /* top12(1024.). */ - -/* Constants associated with pow. */ -#define SmallBoundX 0x1p-126 -#define SmallPowX 0x001 /* top12(0x1p-126). */ -#define BigPowX 0x7ff /* top12(INFINITY). */ -#define ThresPowX 0x7fe /* BigPowX - SmallPowX. */ -#define SmallPowY 0x3be /* top12(0x1.e7b6p-65). */ -#define BigPowY 0x43e /* top12(0x1.749p62). */ -#define ThresPowY 0x080 /* BigPowY - SmallPowY. */ - -static const struct data -{ - double log_c0, log_c2, log_c4, log_c6, ln2_hi, ln2_lo; - double log_c1, log_c3, log_c5, off; - double n_over_ln2, exp_c2, ln2_over_n_hi, ln2_over_n_lo; - double exp_c0, exp_c1; -} data = { - .log_c0 = -0x1p-1, - .log_c1 = -0x1.555555555556p-1, - .log_c2 = 0x1.0000000000006p-1, - .log_c3 = 0x1.999999959554ep-1, - .log_c4 = -0x1.555555529a47ap-1, - .log_c5 = -0x1.2495b9b4845e9p0, - .log_c6 = 0x1.0002b8b263fc3p0, - .off = Off, - .exp_c0 = 0x1.fffffffffffd4p-2, - .exp_c1 = 0x1.5555571d6ef9p-3, - .exp_c2 = 0x1.5555576a5adcep-5, - .ln2_hi = 0x1.62e42fefa3800p-1, - .ln2_lo = 0x1.ef35793c76730p-45, - .n_over_ln2 = 0x1.71547652b82fep0 * N_EXP, - .ln2_over_n_hi = 0x1.62e42fefc0000p-9, - .ln2_over_n_lo = -0x1.c610ca86c3899p-45, -}; - -/* Check if x is an integer. */ -static inline svbool_t -sv_isint (svbool_t pg, svfloat64_t x) -{ - return svcmpeq (pg, svrintz_z (pg, x), x); -} - -/* Check if x is real not integer valued. */ -static inline svbool_t -sv_isnotint (svbool_t pg, svfloat64_t x) -{ - return svcmpne (pg, svrintz_z (pg, x), x); -} - -/* Check if x is an odd integer. */ -static inline svbool_t -sv_isodd (svbool_t pg, svfloat64_t x) -{ - svfloat64_t y = svmul_x (svptrue_b64 (), x, 0.5); - return sv_isnotint (pg, y); -} - -/* Returns 0 if not int, 1 if odd int, 2 if even int. The argument is - the bit representation of a non-zero finite floating-point value. */ -static inline int -checkint (uint64_t iy) -{ - int e = iy >> 52 & 0x7ff; - if (e < 0x3ff) - return 0; - if (e > 0x3ff + 52) - return 2; - if (iy & ((1ULL << (0x3ff + 52 - e)) - 1)) - return 0; - if (iy & (1ULL << (0x3ff + 52 - e))) - return 1; - return 2; -} - -/* Top 12 bits (sign and exponent of each double float lane). */ -static inline svuint64_t -sv_top12 (svfloat64_t x) -{ - return svlsr_x (svptrue_b64 (), svreinterpret_u64 (x), 52); -} - -/* Returns 1 if input is the bit representation of 0, infinity or nan. */ -static inline int -zeroinfnan (uint64_t i) -{ - return 2 * i - 1 >= 2 * asuint64 (INFINITY) - 1; -} - -/* Returns 1 if input is the bit representation of 0, infinity or nan. */ -static inline svbool_t -sv_zeroinfnan (svbool_t pg, svuint64_t i) -{ - return svcmpge (pg, svsub_x (pg, svadd_x (pg, i, i), 1), - 2 * asuint64 (INFINITY) - 1); -} - -/* Handle cases that may overflow or underflow when computing the result that - is scale*(1+TMP) without intermediate rounding. The bit representation of - scale is in SBITS, however it has a computed exponent that may have - overflown into the sign bit so that needs to be adjusted before using it as - a double. (int32_t)KI is the k used in the argument reduction and exponent - adjustment of scale, positive k here means the result may overflow and - negative k means the result may underflow. */ -static inline svfloat64_t -specialcase (svfloat64_t tmp, svuint64_t sbits, svuint64_t ki, svbool_t cmp) -{ - svbool_t p_pos = svcmpge_n_f64 (cmp, svreinterpret_f64_u64 (ki), 0.0); - - /* Scale up or down depending on sign of k. */ - svint64_t offset - = svsel_s64 (p_pos, sv_s64 (1009ull << 52), sv_s64 (-1022ull << 52)); - svfloat64_t factor - = svsel_f64 (p_pos, sv_f64 (0x1p1009), sv_f64 (0x1p-1022)); - - svuint64_t offset_sbits - = svsub_u64_x (cmp, sbits, svreinterpret_u64_s64 (offset)); - svfloat64_t scale = svreinterpret_f64_u64 (offset_sbits); - svfloat64_t res = svmad_f64_x (cmp, scale, tmp, scale); - return svmul_f64_x (cmp, res, factor); -} - -/* Compute y+TAIL = log(x) where the rounded result is y and TAIL has about - additional 15 bits precision. IX is the bit representation of x, but - normalized in the subnormal range using the sign bit for the exponent. */ -static inline svfloat64_t -sv_log_inline (svbool_t pg, svuint64_t ix, svfloat64_t *tail, - const struct data *d) -{ - /* x = 2^k z; where z is in range [Off,2*Off) and exact. - The range is split into N subintervals. - The ith subinterval contains z and c is near its center. */ - svuint64_t tmp = svsub_x (pg, ix, d->off); - svuint64_t i = svand_x (pg, svlsr_x (pg, tmp, 52 - V_POW_LOG_TABLE_BITS), - sv_u64 (N_LOG - 1)); - svint64_t k = svasr_x (pg, svreinterpret_s64 (tmp), 52); - svuint64_t iz = svsub_x (pg, ix, svlsl_x (pg, svreinterpret_u64 (k), 52)); - svfloat64_t z = svreinterpret_f64 (iz); - svfloat64_t kd = svcvt_f64_x (pg, k); - - /* log(x) = k*Ln2 + log(c) + log1p(z/c-1). */ - /* SVE lookup requires 3 separate lookup tables, as opposed to scalar version - that uses array of structures. We also do the lookup earlier in the code - to make sure it finishes as early as possible. */ - svfloat64_t invc = svld1_gather_index (pg, __v_pow_log_data.invc, i); - svfloat64_t logc = svld1_gather_index (pg, __v_pow_log_data.logc, i); - svfloat64_t logctail = svld1_gather_index (pg, __v_pow_log_data.logctail, i); - - /* Note: 1/c is j/N or j/N/2 where j is an integer in [N,2N) and - |z/c - 1| < 1/N, so r = z/c - 1 is exactly representible. */ - svfloat64_t r = svmad_x (pg, z, invc, -1.0); - /* k*Ln2 + log(c) + r. */ - - svfloat64_t ln2_hilo = svld1rq_f64 (svptrue_b64 (), &d->ln2_hi); - svfloat64_t t1 = svmla_lane_f64 (logc, kd, ln2_hilo, 0); - svfloat64_t t2 = svadd_x (pg, t1, r); - svfloat64_t lo1 = svmla_lane_f64 (logctail, kd, ln2_hilo, 1); - svfloat64_t lo2 = svadd_x (pg, svsub_x (pg, t1, t2), r); - - /* Evaluation is optimized assuming superscalar pipelined execution. */ - - svfloat64_t log_c02 = svld1rq_f64 (svptrue_b64 (), &d->log_c0); - svfloat64_t ar = svmul_lane_f64 (r, log_c02, 0); - svfloat64_t ar2 = svmul_x (svptrue_b64 (), r, ar); - svfloat64_t ar3 = svmul_x (svptrue_b64 (), r, ar2); - /* k*Ln2 + log(c) + r + A[0]*r*r. */ - svfloat64_t hi = svadd_x (pg, t2, ar2); - svfloat64_t lo3 = svmls_x (pg, ar2, ar, r); - svfloat64_t lo4 = svadd_x (pg, svsub_x (pg, t2, hi), ar2); - /* p = log1p(r) - r - A[0]*r*r. */ - /* p = (ar3 * (A[1] + r * A[2] + ar2 * (A[3] + r * A[4] + ar2 * (A[5] + r * - A[6])))). */ - - svfloat64_t log_c46 = svld1rq_f64 (svptrue_b64 (), &d->log_c4); - svfloat64_t a56 = svmla_lane_f64 (sv_f64 (d->log_c5), r, log_c46, 1); - svfloat64_t a34 = svmla_lane_f64 (sv_f64 (d->log_c3), r, log_c46, 0); - svfloat64_t a12 = svmla_lane_f64 (sv_f64 (d->log_c1), r, log_c02, 1); - svfloat64_t p = svmla_x (pg, a34, ar2, a56); - p = svmla_x (pg, a12, ar2, p); - p = svmul_x (svptrue_b64 (), ar3, p); - svfloat64_t lo = svadd_x ( - pg, svadd_x (pg, svsub_x (pg, svadd_x (pg, lo1, lo2), lo3), lo4), p); - svfloat64_t y = svadd_x (pg, hi, lo); - *tail = svadd_x (pg, svsub_x (pg, hi, y), lo); - return y; -} - -static inline svfloat64_t -sv_exp_core (svbool_t pg, svfloat64_t x, svfloat64_t xtail, - svuint64_t sign_bias, svfloat64_t *tmp, svuint64_t *sbits, - svuint64_t *ki, const struct data *d) -{ - /* exp(x) = 2^(k/N) * exp(r), with exp(r) in [2^(-1/2N),2^(1/2N)]. */ - /* x = ln2/N*k + r, with int k and r in [-ln2/2N, ln2/2N]. */ - svfloat64_t n_over_ln2_and_c2 = svld1rq_f64 (svptrue_b64 (), &d->n_over_ln2); - svfloat64_t z = svmul_lane_f64 (x, n_over_ln2_and_c2, 0); - /* z - kd is in [-1, 1] in non-nearest rounding modes. */ - svfloat64_t kd = svrinta_x (pg, z); - *ki = svreinterpret_u64 (svcvt_s64_x (pg, kd)); - - svfloat64_t ln2_over_n_hilo - = svld1rq_f64 (svptrue_b64 (), &d->ln2_over_n_hi); - svfloat64_t r = x; - r = svmls_lane_f64 (r, kd, ln2_over_n_hilo, 0); - r = svmls_lane_f64 (r, kd, ln2_over_n_hilo, 1); - /* The code assumes 2^-200 < |xtail| < 2^-8/N. */ - r = svadd_x (pg, r, xtail); - /* 2^(k/N) ~= scale. */ - svuint64_t idx = svand_x (pg, *ki, N_EXP - 1); - svuint64_t top - = svlsl_x (pg, svadd_x (pg, *ki, sign_bias), 52 - V_POW_EXP_TABLE_BITS); - /* This is only a valid scale when -1023*N < k < 1024*N. */ - *sbits = svld1_gather_index (pg, __v_pow_exp_data.sbits, idx); - *sbits = svadd_x (pg, *sbits, top); - /* exp(x) = 2^(k/N) * exp(r) ~= scale + scale * (exp(r) - 1). */ - svfloat64_t r2 = svmul_x (svptrue_b64 (), r, r); - *tmp = svmla_lane_f64 (sv_f64 (d->exp_c1), r, n_over_ln2_and_c2, 1); - *tmp = svmla_x (pg, sv_f64 (d->exp_c0), r, *tmp); - *tmp = svmla_x (pg, r, r2, *tmp); - svfloat64_t scale = svreinterpret_f64 (*sbits); - /* Note: tmp == 0 or |tmp| > 2^-200 and scale > 2^-739, so there - is no spurious underflow here even without fma. */ - z = svmla_x (pg, scale, scale, *tmp); - return z; -} - -/* Computes sign*exp(x+xtail) where |xtail| < 2^-8/N and |xtail| <= |x|. - The sign_bias argument is SignBias or 0 and sets the sign to -1 or 1. */ -static inline svfloat64_t -sv_exp_inline (svbool_t pg, svfloat64_t x, svfloat64_t xtail, - svuint64_t sign_bias, const struct data *d) -{ - /* 3 types of special cases: tiny (uflow and spurious uflow), huge (oflow) - and other cases of large values of x (scale * (1 + TMP) oflow). */ - svuint64_t abstop = svand_x (pg, sv_top12 (x), 0x7ff); - /* |x| is large (|x| >= 512) or tiny (|x| <= 0x1p-54). */ - svbool_t uoflow = svcmpge (pg, svsub_x (pg, abstop, SmallExp), ThresExp); - - svfloat64_t tmp; - svuint64_t sbits, ki; - if (__glibc_unlikely (svptest_any (pg, uoflow))) - { - svfloat64_t z - = sv_exp_core (pg, x, xtail, sign_bias, &tmp, &sbits, &ki, d); - - /* |x| is tiny (|x| <= 0x1p-54). */ - svbool_t uflow - = svcmpge (pg, svsub_x (pg, abstop, SmallExp), 0x80000000); - uflow = svand_z (pg, uoflow, uflow); - /* |x| is huge (|x| >= 1024). */ - svbool_t oflow = svcmpge (pg, abstop, HugeExp); - oflow = svand_z (pg, uoflow, svbic_z (pg, oflow, uflow)); - - /* Handle underflow and overlow in scale. - For large |x| values (512 < |x| < 1024), scale * (1 + TMP) can - overflow or underflow. */ - svbool_t special = svbic_z (pg, uoflow, svorr_z (pg, uflow, oflow)); - if (__glibc_unlikely (svptest_any (pg, special))) - z = svsel (special, specialcase (tmp, sbits, ki, special), z); - - /* Handle underflow and overflow in exp. */ - svbool_t x_is_neg = svcmplt (pg, x, 0); - svuint64_t sign_mask - = svlsl_x (pg, sign_bias, 52 - V_POW_EXP_TABLE_BITS); - svfloat64_t res_uoflow - = svsel (x_is_neg, sv_f64 (0.0), sv_f64 (INFINITY)); - res_uoflow = svreinterpret_f64 ( - svorr_x (pg, svreinterpret_u64 (res_uoflow), sign_mask)); - /* Avoid spurious underflow for tiny x. */ - svfloat64_t res_spurious_uflow - = svreinterpret_f64 (svorr_x (pg, sign_mask, 0x3ff0000000000000)); - - z = svsel (oflow, res_uoflow, z); - z = svsel (uflow, res_spurious_uflow, z); - return z; - } - - return sv_exp_core (pg, x, xtail, sign_bias, &tmp, &sbits, &ki, d); -} +#define WANT_SV_POW_SIGN_BIAS 1 +#include "sv_pow_inline.h" static inline double pow_specialcase (double x, double y) @@ -398,18 +111,14 @@ svfloat64_t SV_NAME_D2 (pow) (svfloat64_t x, svfloat64_t y, const svbool_t pg) sign_bias = svsel (yisodd_xisneg, sv_u64 (SignBias), sv_u64 (0)); } - /* Small cases of x: |x| < 0x1p-126. */ - svbool_t xsmall = svaclt (yint_or_xpos, x, SmallBoundX); - if (__glibc_unlikely (svptest_any (yint_or_xpos, xsmall))) + /* Cases of subnormal x: |x| < 0x1p-1022. */ + svbool_t x_is_subnormal = svaclt (yint_or_xpos, x, 0x1p-1022); + if (__glibc_unlikely (svptest_any (yint_or_xpos, x_is_subnormal))) { /* Normalize subnormal x so exponent becomes negative. */ - svuint64_t vtopx = svlsr_x (svptrue_b64 (), vix, 52); - svbool_t topx_is_null = svcmpeq (xsmall, vtopx, 0); - - svuint64_t vix_norm = svreinterpret_u64 (svmul_m (xsmall, x, 0x1p52)); - vix_norm = svand_m (xsmall, vix_norm, 0x7fffffffffffffff); - vix_norm = svsub_m (xsmall, vix_norm, 52ULL << 52); - vix = svsel (topx_is_null, vix_norm, vix); + vix = svreinterpret_u64 (svmul_m (x_is_subnormal, x, 0x1p52)); + vix = svand_m (x_is_subnormal, vix, 0x7fffffffffffffff); + vix = svsub_m (x_is_subnormal, vix, 52ULL << 52); } /* y_hi = log(ix, &y_lo). */ diff --git a/sysdeps/aarch64/fpu/powf_advsimd.c b/sysdeps/aarch64/fpu/powf_advsimd.c index 5a4626b58e..81a8bcaaf9 100644 --- a/sysdeps/aarch64/fpu/powf_advsimd.c +++ b/sysdeps/aarch64/fpu/powf_advsimd.c @@ -17,194 +17,148 @@ License along with the GNU C Library; if not, see . */ -#include "math_config.h" +#include "flt-32/math_config.h" #include "v_math.h" +#include "v_powf_inline.h" -#define Min v_u32 (0x00800000) -#define Max v_u32 (0x7f800000) -#define Thresh v_u32 (0x7f000000) /* Max - Min. */ -#define MantissaMask v_u32 (0x007fffff) - -#define A d->log2_poly -#define C d->exp2f_poly - -/* 2.6 ulp ~ 0.5 + 2^24 (128*Ln2*relerr_log2 + relerr_exp2). */ -#define Off v_u32 (0x3f35d000) - -#define V_POWF_LOG2_TABLE_BITS 5 -#define V_EXP2F_TABLE_BITS 5 -#define Log2IdxMask ((1 << V_POWF_LOG2_TABLE_BITS) - 1) -#define Scale ((double) (1 << V_EXP2F_TABLE_BITS)) - -static const struct data +/* Check if x is an integer. */ +static inline uint32x4_t +visint (float32x4_t x) { - struct - { - double invc, logc; - } log2_tab[1 << V_POWF_LOG2_TABLE_BITS]; - float64x2_t log2_poly[4]; - uint64_t exp2f_tab[1 << V_EXP2F_TABLE_BITS]; - float64x2_t exp2f_poly[3]; -} data = { - .log2_tab = {{0x1.6489890582816p+0, -0x1.e960f97b22702p-2 * Scale}, - {0x1.5cf19b35e3472p+0, -0x1.c993406cd4db6p-2 * Scale}, - {0x1.55aac0e956d65p+0, -0x1.aa711d9a7d0f3p-2 * Scale}, - {0x1.4eb0022977e01p+0, -0x1.8bf37bacdce9bp-2 * Scale}, - {0x1.47fcccda1dd1fp+0, -0x1.6e13b3519946ep-2 * Scale}, - {0x1.418ceabab68c1p+0, -0x1.50cb8281e4089p-2 * Scale}, - {0x1.3b5c788f1edb3p+0, -0x1.341504a237e2bp-2 * Scale}, - {0x1.3567de48e9c9ap+0, -0x1.17eaab624ffbbp-2 * Scale}, - {0x1.2fabc80fd19bap+0, -0x1.f88e708f8c853p-3 * Scale}, - {0x1.2a25200ce536bp+0, -0x1.c24b6da113914p-3 * Scale}, - {0x1.24d108e0152e3p+0, -0x1.8d02ee397cb1dp-3 * Scale}, - {0x1.1facd8ab2fbe1p+0, -0x1.58ac1223408b3p-3 * Scale}, - {0x1.1ab614a03efdfp+0, -0x1.253e6fd190e89p-3 * Scale}, - {0x1.15ea6d03af9ffp+0, -0x1.e5641882c12ffp-4 * Scale}, - {0x1.1147b994bb776p+0, -0x1.81fea712926f7p-4 * Scale}, - {0x1.0ccbf650593aap+0, -0x1.203e240de64a3p-4 * Scale}, - {0x1.0875408477302p+0, -0x1.8029b86a78281p-5 * Scale}, - {0x1.0441d42a93328p+0, -0x1.85d713190fb9p-6 * Scale}, - {0x1p+0, 0x0p+0 * Scale}, - {0x1.f1d006c855e86p-1, 0x1.4c1cc07312997p-5 * Scale}, - {0x1.e28c3341aa301p-1, 0x1.5e1848ccec948p-4 * Scale}, - {0x1.d4bdf9aa64747p-1, 0x1.04cfcb7f1196fp-3 * Scale}, - {0x1.c7b45a24e5803p-1, 0x1.582813d463c21p-3 * Scale}, - {0x1.bb5f5eb2ed60ap-1, 0x1.a936fa68760ccp-3 * Scale}, - {0x1.afb0bff8fe6b4p-1, 0x1.f81bc31d6cc4ep-3 * Scale}, - {0x1.a49badf7ab1f5p-1, 0x1.2279a09fae6b1p-2 * Scale}, - {0x1.9a14a111fc4c9p-1, 0x1.47ec0b6df5526p-2 * Scale}, - {0x1.901131f5b2fdcp-1, 0x1.6c71762280f1p-2 * Scale}, - {0x1.8687f73f6d865p-1, 0x1.90155070798dap-2 * Scale}, - {0x1.7d7067eb77986p-1, 0x1.b2e23b1d3068cp-2 * Scale}, - {0x1.74c2c1cf97b65p-1, 0x1.d4e21b0daa86ap-2 * Scale}, - {0x1.6c77f37cff2a1p-1, 0x1.f61e2a2f67f3fp-2 * Scale},}, - .log2_poly = { /* rel err: 1.5 * 2^-30. */ - V2 (-0x1.6ff5daa3b3d7cp-2 * Scale), - V2 (0x1.ec81d03c01aebp-2 * Scale), - V2 (-0x1.71547bb43f101p-1 * Scale), - V2 (0x1.7154764a815cbp0 * Scale)}, - .exp2f_tab = {0x3ff0000000000000, 0x3fefd9b0d3158574, 0x3fefb5586cf9890f, - 0x3fef9301d0125b51, 0x3fef72b83c7d517b, 0x3fef54873168b9aa, - 0x3fef387a6e756238, 0x3fef1e9df51fdee1, 0x3fef06fe0a31b715, - 0x3feef1a7373aa9cb, 0x3feedea64c123422, 0x3feece086061892d, - 0x3feebfdad5362a27, 0x3feeb42b569d4f82, 0x3feeab07dd485429, - 0x3feea47eb03a5585, 0x3feea09e667f3bcd, 0x3fee9f75e8ec5f74, - 0x3feea11473eb0187, 0x3feea589994cce13, 0x3feeace5422aa0db, - 0x3feeb737b0cdc5e5, 0x3feec49182a3f090, 0x3feed503b23e255d, - 0x3feee89f995ad3ad, 0x3feeff76f2fb5e47, 0x3fef199bdd85529c, - 0x3fef3720dcef9069, 0x3fef5818dcfba487, 0x3fef7c97337b9b5f, - 0x3fefa4afa2a490da, 0x3fefd0765b6e4540,}, - .exp2f_poly = { /* rel err: 1.69 * 2^-34. */ - V2 (0x1.c6af84b912394p-5 / Scale / Scale / Scale), - V2 (0x1.ebfce50fac4f3p-3 / Scale / Scale), - V2 (0x1.62e42ff0c52d6p-1 / Scale)}}; + return vceqq_f32 (vrndq_f32 (x), x); +} -static float32x4_t VPCS_ATTR NOINLINE -special_case (float32x4_t x, float32x4_t y, float32x4_t ret, uint32x4_t cmp) +/* Check if x is real not integer valued. */ +static inline uint32x4_t +visnotint (float32x4_t x) { - return v_call2_f32 (powf, x, y, ret, cmp); + return vmvnq_u32 (vceqq_f32 (vrndq_f32 (x), x)); } -static inline float64x2_t -ylogx_core (const struct data *d, float64x2_t iz, float64x2_t k, - float64x2_t invc, float64x2_t logc, float64x2_t y) +/* Check if x is an odd integer. */ +static inline uint32x4_t +visodd (float32x4_t x) { - - /* log2(x) = log1p(z/c-1)/ln2 + log2(c) + k. */ - float64x2_t r = vfmaq_f64 (v_f64 (-1.0), iz, invc); - float64x2_t y0 = vaddq_f64 (logc, k); - - /* Polynomial to approximate log1p(r)/ln2. */ - float64x2_t logx = vfmaq_f64 (A[1], r, A[0]); - logx = vfmaq_f64 (A[2], logx, r); - logx = vfmaq_f64 (A[3], logx, r); - logx = vfmaq_f64 (y0, logx, r); - - return vmulq_f64 (logx, y); + float32x4_t y = vmulq_n_f32 (x, 0.5f); + return visnotint (y); } -static inline float64x2_t -log2_lookup (const struct data *d, uint32_t i) +/* Returns 0 if not int, 1 if odd int, 2 if even int. The argument is + the bit representation of a non-zero finite floating-point value. */ +static inline int +checkint (uint32_t iy) { - return vld1q_f64 ( - &d->log2_tab[(i >> (23 - V_POWF_LOG2_TABLE_BITS)) & Log2IdxMask].invc); + int e = iy >> 23 & 0xff; + if (e < 0x7f) + return 0; + if (e > 0x7f + 23) + return 2; + if (iy & ((1 << (0x7f + 23 - e)) - 1)) + return 0; + if (iy & (1 << (0x7f + 23 - e))) + return 1; + return 2; } -static inline uint64x1_t -exp2f_lookup (const struct data *d, uint64_t i) +/* A scalar subroutine used to fix main power special cases. Similar to the + preamble of scalar powf except that we do not update ix and sign_bias. This + is done in the preamble of the SVE powf. */ +static inline float +powf_specialcase (float x, float y) { - return vld1_u64 (&d->exp2f_tab[i % (1 << V_EXP2F_TABLE_BITS)]); + uint32_t ix = asuint (x); + uint32_t iy = asuint (y); + /* Either x or y is 0 or inf or nan. */ + if (__glibc_unlikely (zeroinfnan (iy))) + { + if (2 * iy == 0) + return issignalingf_inline (x) ? x + y : 1.0f; + if (ix == 0x3f800000) + return issignalingf_inline (y) ? x + y : 1.0f; + if (2 * ix > 2u * 0x7f800000 || 2 * iy > 2u * 0x7f800000) + return x + y; + if (2 * ix == 2 * 0x3f800000) + return 1.0f; + if ((2 * ix < 2 * 0x3f800000) == !(iy & 0x80000000)) + return 0.0f; /* |x|<1 && y==inf or |x|>1 && y==-inf. */ + return y * y; + } + if (__glibc_unlikely (zeroinfnan (ix))) + { + float x2 = x * x; + if (ix & 0x80000000 && checkint (iy) == 1) + x2 = -x2; + return iy & 0x80000000 ? 1 / x2 : x2; + } + return x; } -static inline float32x2_t -powf_core (const struct data *d, float64x2_t ylogx) +/* Special case function wrapper. */ +static float32x4_t VPCS_ATTR NOINLINE +special_case (float32x4_t x, float32x4_t y, float32x4_t ret, uint32x4_t cmp) { - /* N*x = k + r with r in [-1/2, 1/2]. */ - float64x2_t kd = vrndnq_f64 (ylogx); - int64x2_t ki = vcvtaq_s64_f64 (ylogx); - float64x2_t r = vsubq_f64 (ylogx, kd); - - /* exp2(x) = 2^(k/N) * 2^r ~= s * (C0*r^3 + C1*r^2 + C2*r + 1). */ - uint64x2_t t = vcombine_u64 (exp2f_lookup (d, vgetq_lane_s64 (ki, 0)), - exp2f_lookup (d, vgetq_lane_s64 (ki, 1))); - t = vaddq_u64 ( - t, vreinterpretq_u64_s64 (vshlq_n_s64 (ki, 52 - V_EXP2F_TABLE_BITS))); - float64x2_t s = vreinterpretq_f64_u64 (t); - float64x2_t p = vfmaq_f64 (C[1], r, C[0]); - p = vfmaq_f64 (C[2], r, p); - p = vfmaq_f64 (s, p, vmulq_f64 (s, r)); - return vcvt_f32_f64 (p); + return v_call2_f32 (powf_specialcase, x, y, ret, cmp); } -float32x4_t VPCS_ATTR V_NAME_F2 (pow) (float32x4_t x, float32x4_t y) +/* Power implementation for x containing negative or subnormal lanes. */ +static inline float32x4_t +v_powf_x_is_neg_or_small (float32x4_t x, float32x4_t y, const struct data *d) { - const struct data *d = ptr_barrier (&data); - uint32x4_t u = vreinterpretq_u32_f32 (x); - uint32x4_t cmp = vcgeq_u32 (vsubq_u32 (u, Min), Thresh); - uint32x4_t tmp = vsubq_u32 (u, Off); - uint32x4_t top = vbicq_u32 (tmp, MantissaMask); - float32x4_t iz = vreinterpretq_f32_u32 (vsubq_u32 (u, top)); - int32x4_t k = vshrq_n_s32 (vreinterpretq_s32_u32 (top), - 23 - V_EXP2F_TABLE_BITS); /* arithmetic shift. */ - - /* Use double precision for each lane: split input vectors into lo and hi - halves and promote. */ - float64x2_t tab0 = log2_lookup (d, vgetq_lane_u32 (tmp, 0)), - tab1 = log2_lookup (d, vgetq_lane_u32 (tmp, 1)), - tab2 = log2_lookup (d, vgetq_lane_u32 (tmp, 2)), - tab3 = log2_lookup (d, vgetq_lane_u32 (tmp, 3)); - - float64x2_t iz_lo = vcvt_f64_f32 (vget_low_f32 (iz)), - iz_hi = vcvt_high_f64_f32 (iz); + uint32x4_t xisneg = vcltzq_f32 (x); + uint32x4_t xsmall = vcaltq_f32 (x, v_f32 (0x1p-126f)); - float64x2_t k_lo = vcvtq_f64_s64 (vmovl_s32 (vget_low_s32 (k))), - k_hi = vcvtq_f64_s64 (vmovl_high_s32 (k)); + /* Set sign_bias depending on sign of x and nature of y. */ + uint32x4_t yisodd_xisneg = vandq_u32 (visodd (y), xisneg); - float64x2_t invc_lo = vzip1q_f64 (tab0, tab1), - invc_hi = vzip1q_f64 (tab2, tab3), - logc_lo = vzip2q_f64 (tab0, tab1), - logc_hi = vzip2q_f64 (tab2, tab3); + /* Set variable to SignBias if x is negative and y is odd. */ + uint32x4_t sign_bias = vandq_u32 (d->sign_bias, yisodd_xisneg); - float64x2_t y_lo = vcvt_f64_f32 (vget_low_f32 (y)), - y_hi = vcvt_high_f64_f32 (y); + /* Normalize subnormals. */ + float32x4_t a = vabsq_f32 (x); + uint32x4_t ia_norm = vreinterpretq_u32_f32 (vmulq_f32 (a, d->norm)); + ia_norm = vsubq_u32 (ia_norm, d->subnormal_bias); + a = vbslq_f32 (xsmall, vreinterpretq_f32_u32 (ia_norm), a); - float64x2_t ylogx_lo = ylogx_core (d, iz_lo, k_lo, invc_lo, logc_lo, y_lo); - float64x2_t ylogx_hi = ylogx_core (d, iz_hi, k_hi, invc_hi, logc_hi, y_hi); + /* Evaluate exp (y * log(x)) using |x| and sign bias correction. */ + float32x4_t ret = v_powf_core (a, y, sign_bias, d); - uint32x4_t ylogx_top = vuzp2q_u32 (vreinterpretq_u32_f64 (ylogx_lo), - vreinterpretq_u32_f64 (ylogx_hi)); - - cmp = vorrq_u32 ( - cmp, vcgeq_u32 (vandq_u32 (vshrq_n_u32 (ylogx_top, 15), v_u32 (0xffff)), - vdupq_n_u32 (asuint64 (126.0 * (1 << V_EXP2F_TABLE_BITS)) - >> 47))); + /* Cases of finite y and finite negative x. */ + uint32x4_t yint_or_xpos = vornq_u32 (visint (y), xisneg); + return vbslq_f32 (yint_or_xpos, ret, d->nan); +} - float32x2_t p_lo = powf_core (d, ylogx_lo); - float32x2_t p_hi = powf_core (d, ylogx_hi); +/* Implementation of AdvSIMD powf. + The theoretical maximum error is under 2.60 ULPs. + Maximum measured error is 2.57 ULPs: + V_NAME_F2 (pow) (0x1.031706p+0, 0x1.ce2ec2p+12) + got 0x1.fff868p+127 + want 0x1.fff862p+127. */ +float32x4_t VPCS_ATTR NOINLINE V_NAME_F2 (pow) (float32x4_t x, float32x4_t y) +{ + const struct data *d = ptr_barrier (&data); + /* Special cases of x or y: zero, inf and nan. */ + uint32x4_t ix = vreinterpretq_u32_f32 (x); + uint32x4_t iy = vreinterpretq_u32_f32 (y); + uint32x4_t xspecial = v_zeroinfnan (d, ix); + uint32x4_t yspecial = v_zeroinfnan (d, iy); + uint32x4_t cmp = vorrq_u32 (xspecial, yspecial); + + /* Evaluate pow(x, y) for x containing negative or subnormal lanes. */ + uint32x4_t x_is_neg_or_sub = vcltq_f32 (x, v_f32 (0x1p-126f)); + if (__glibc_unlikely (v_any_u32 (x_is_neg_or_sub))) + { + float32x4_t ret = v_powf_x_is_neg_or_small (x, y, d); + if (__glibc_unlikely (v_any_u32 (cmp))) + return special_case (x, y, ret, cmp); + return ret; + } + + /* Else evaluate pow(x, y) for normal and positive x only. + Use the powrf helper routine. */ if (__glibc_unlikely (v_any_u32 (cmp))) - return special_case (x, y, vcombine_f32 (p_lo, p_hi), cmp); - return vcombine_f32 (p_lo, p_hi); + return special_case (x, y, v_powrf_core (x, y, d), cmp); + return v_powrf_core (x, y, d); } libmvec_hidden_def (V_NAME_F2 (pow)) HALF_WIDTH_ALIAS_F2(pow) diff --git a/sysdeps/aarch64/fpu/powf_sve.c b/sysdeps/aarch64/fpu/powf_sve.c index cbe2044926..b3579e3461 100644 --- a/sysdeps/aarch64/fpu/powf_sve.c +++ b/sysdeps/aarch64/fpu/powf_sve.c @@ -17,100 +17,11 @@ License along with the GNU C Library; if not, see . */ -#include "../ieee754/flt-32/math_config.h" +#include "flt-32/math_config.h" #include "sv_math.h" -/* The following data is used in the SVE pow core computation - and special case detection. */ -#define Tinvc __v_powf_data.invc -#define Tlogc __v_powf_data.logc -#define Texp __v_powf_data.scale -#define SignBias (1 << (V_POWF_EXP2_TABLE_BITS + 11)) -#define Norm 0x1p23f /* 0x4b000000. */ - -/* Overall ULP error bound for pow is 2.6 ulp - ~ 0.5 + 2^24 (128*Ln2*relerr_log2 + relerr_exp2). */ -static const struct data -{ - double log_poly[4]; - double exp_poly[3]; - float uflow_bound, oflow_bound, small_bound; - uint32_t sign_bias, subnormal_bias, off; -} data = { - /* rel err: 1.5 * 2^-30. Each coefficients is multiplied the value of - V_POWF_EXP2_N. */ - .log_poly = { -0x1.6ff5daa3b3d7cp+3, 0x1.ec81d03c01aebp+3, - -0x1.71547bb43f101p+4, 0x1.7154764a815cbp+5 }, - /* rel err: 1.69 * 2^-34. */ - .exp_poly = { - 0x1.c6af84b912394p-20, /* A0 / V_POWF_EXP2_N^3. */ - 0x1.ebfce50fac4f3p-13, /* A1 / V_POWF_EXP2_N^2. */ - 0x1.62e42ff0c52d6p-6, /* A3 / V_POWF_EXP2_N. */ - }, - .uflow_bound = -0x1.2cp+12f, /* -150.0 * V_POWF_EXP2_N. */ - .oflow_bound = 0x1p+12f, /* 128.0 * V_POWF_EXP2_N. */ - .small_bound = 0x1p-126f, - .off = 0x3f35d000, - .sign_bias = SignBias, - .subnormal_bias = 0x0b800000, /* 23 << 23. */ -}; - -#define A(i) sv_f64 (d->log_poly[i]) -#define C(i) sv_f64 (d->exp_poly[i]) - -/* Check if x is an integer. */ -static inline svbool_t -svisint (svbool_t pg, svfloat32_t x) -{ - return svcmpeq (pg, svrintz_z (pg, x), x); -} - -/* Check if x is real not integer valued. */ -static inline svbool_t -svisnotint (svbool_t pg, svfloat32_t x) -{ - return svcmpne (pg, svrintz_z (pg, x), x); -} - -/* Check if x is an odd integer. */ -static inline svbool_t -svisodd (svbool_t pg, svfloat32_t x) -{ - svfloat32_t y = svmul_x (pg, x, 0.5f); - return svisnotint (pg, y); -} - -/* Check if zero, inf or nan. */ -static inline svbool_t -sv_zeroinfnan (svbool_t pg, svuint32_t i) -{ - return svcmpge (pg, svsub_x (pg, svadd_x (pg, i, i), 1), - 2u * 0x7f800000 - 1); -} - -/* Returns 0 if not int, 1 if odd int, 2 if even int. The argument is - the bit representation of a non-zero finite floating-point value. */ -static inline int -checkint (uint32_t iy) -{ - int e = iy >> 23 & 0xff; - if (e < 0x7f) - return 0; - if (e > 0x7f + 23) - return 2; - if (iy & ((1 << (0x7f + 23 - e)) - 1)) - return 0; - if (iy & (1 << (0x7f + 23 - e))) - return 1; - return 2; -} - -/* Check if zero, inf or nan. */ -static inline int -zeroinfnan (uint32_t ix) -{ - return 2 * ix - 1 >= 2u * 0x7f800000 - 1; -} +#define WANT_SV_POWF_SIGN_BIAS 1 +#include "sv_powf_inline.h" /* A scalar subroutine used to fix main power special cases. Similar to the preamble of scalar powf except that we do not update ix and sign_bias. This @@ -152,91 +63,6 @@ sv_call_powf_sc (svfloat32_t x1, svfloat32_t x2, svfloat32_t y, svbool_t cmp) return sv_call2_f32 (powf_specialcase, x1, x2, y, cmp); } -/* Compute core for half of the lanes in double precision. */ -static inline svfloat64_t -sv_powf_core_ext (const svbool_t pg, svuint64_t i, svfloat64_t z, svint64_t k, - svfloat64_t y, svuint64_t sign_bias, svfloat64_t *pylogx, - const struct data *d) -{ - svfloat64_t invc = svld1_gather_index (pg, Tinvc, i); - svfloat64_t logc = svld1_gather_index (pg, Tlogc, i); - - /* log2(x) = log1p(z/c-1)/ln2 + log2(c) + k. */ - svfloat64_t r = svmla_x (pg, sv_f64 (-1.0), z, invc); - svfloat64_t y0 = svadd_x (pg, logc, svcvt_f64_x (pg, k)); - - /* Polynomial to approximate log1p(r)/ln2. */ - svfloat64_t logx = A (0); - logx = svmad_x (pg, r, logx, A (1)); - logx = svmad_x (pg, r, logx, A (2)); - logx = svmad_x (pg, r, logx, A (3)); - logx = svmad_x (pg, r, logx, y0); - *pylogx = svmul_x (pg, y, logx); - - /* z - kd is in [-1, 1] in non-nearest rounding modes. */ - svfloat64_t kd = svrinta_x (svptrue_b64 (), *pylogx); - svuint64_t ki = svreinterpret_u64 (svcvt_s64_x (svptrue_b64 (), kd)); - - r = svsub_x (pg, *pylogx, kd); - - /* exp2(x) = 2^(k/N) * 2^r ~= s * (C0*r^3 + C1*r^2 + C2*r + 1). */ - svuint64_t t = svld1_gather_index ( - svptrue_b64 (), Texp, svand_x (svptrue_b64 (), ki, V_POWF_EXP2_N - 1)); - svuint64_t ski = svadd_x (svptrue_b64 (), ki, sign_bias); - t = svadd_x (svptrue_b64 (), t, - svlsl_x (svptrue_b64 (), ski, 52 - V_POWF_EXP2_TABLE_BITS)); - svfloat64_t s = svreinterpret_f64 (t); - - svfloat64_t p = C (0); - p = svmla_x (pg, C (1), p, r); - p = svmla_x (pg, C (2), p, r); - p = svmla_x (pg, s, p, svmul_x (svptrue_b64 (), s, r)); - - return p; -} - -/* Widen vector to double precision and compute core on both halves of the - vector. Lower cost of promotion by considering all lanes active. */ -static inline svfloat32_t -sv_powf_core (const svbool_t pg, svuint32_t i, svuint32_t iz, svint32_t k, - svfloat32_t y, svuint32_t sign_bias, svfloat32_t *pylogx, - const struct data *d) -{ - const svbool_t ptrue = svptrue_b64 (); - - /* Unpack and promote input vectors (pg, y, z, i, k and sign_bias) into two - in order to perform core computation in double precision. */ - const svbool_t pg_lo = svunpklo (pg); - const svbool_t pg_hi = svunpkhi (pg); - svfloat64_t y_lo = svcvt_f64_x ( - ptrue, svreinterpret_f32 (svunpklo (svreinterpret_u32 (y)))); - svfloat64_t y_hi = svcvt_f64_x ( - ptrue, svreinterpret_f32 (svunpkhi (svreinterpret_u32 (y)))); - svfloat64_t z_lo = svcvt_f64_x (ptrue, svreinterpret_f32 (svunpklo (iz))); - svfloat64_t z_hi = svcvt_f64_x (ptrue, svreinterpret_f32 (svunpkhi (iz))); - svuint64_t i_lo = svunpklo (i); - svuint64_t i_hi = svunpkhi (i); - svint64_t k_lo = svunpklo (k); - svint64_t k_hi = svunpkhi (k); - svuint64_t sign_bias_lo = svunpklo (sign_bias); - svuint64_t sign_bias_hi = svunpkhi (sign_bias); - - /* Compute each part in double precision. */ - svfloat64_t ylogx_lo, ylogx_hi; - svfloat64_t lo = sv_powf_core_ext (pg_lo, i_lo, z_lo, k_lo, y_lo, - sign_bias_lo, &ylogx_lo, d); - svfloat64_t hi = sv_powf_core_ext (pg_hi, i_hi, z_hi, k_hi, y_hi, - sign_bias_hi, &ylogx_hi, d); - - /* Convert back to single-precision and interleave. */ - svfloat32_t ylogx_lo_32 = svcvt_f32_x (ptrue, ylogx_lo); - svfloat32_t ylogx_hi_32 = svcvt_f32_x (ptrue, ylogx_hi); - *pylogx = svuzp1 (ylogx_lo_32, ylogx_hi_32); - svfloat32_t lo_32 = svcvt_f32_x (ptrue, lo); - svfloat32_t hi_32 = svcvt_f32_x (ptrue, hi); - return svuzp1 (lo_32, hi_32); -} - /* Implementation of SVE powf. Provides the same accuracy as AdvSIMD powf, since it relies on the same algorithm. The theoretical maximum error is under 2.60 ULPs. @@ -273,15 +99,14 @@ svfloat32_t SV_NAME_F2 (pow) (svfloat32_t x, svfloat32_t y, const svbool_t pg) svbool_t yspecial = sv_zeroinfnan (pg, viy0); svbool_t cmp = svorr_z (pg, xspecial, yspecial); - /* Small cases of x: |x| < 0x1p-126. */ - svbool_t xsmall = svaclt (yint_or_xpos, x, d->small_bound); - if (__glibc_unlikely (svptest_any (yint_or_xpos, xsmall))) + /* Cases of subnormal x: |x| < 0x1p-126. */ + svbool_t x_is_subnormal = svaclt (yint_or_xpos, x, d->small_bound); + if (__glibc_unlikely (svptest_any (yint_or_xpos, x_is_subnormal))) { /* Normalize subnormal x so exponent becomes negative. */ - svuint32_t vix_norm = svreinterpret_u32 (svmul_x (xsmall, x, Norm)); - vix_norm = svand_x (xsmall, vix_norm, 0x7fffffff); - vix_norm = svsub_x (xsmall, vix_norm, d->subnormal_bias); - vix = svsel (xsmall, vix_norm, vix); + vix = svreinterpret_u32 (svmul_m (x_is_subnormal, x, 0x1p23f)); + vix = svand_m (x_is_subnormal, vix, 0x7fffffff); + vix = svsub_m (x_is_subnormal, vix, d->subnormal_bias); } /* Part of core computation carried in working precision. */ svuint32_t tmp = svsub_x (yint_or_xpos, vix, d->off); diff --git a/sysdeps/aarch64/fpu/sv_pow_inline.h b/sysdeps/aarch64/fpu/sv_pow_inline.h new file mode 100644 index 0000000000..5a96e95bdf --- /dev/null +++ b/sysdeps/aarch64/fpu/sv_pow_inline.h @@ -0,0 +1,304 @@ +/* Helper for SVE double-precision pow + + Copyright (C) 2025 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifndef WANT_SV_POW_SIGN_BIAS +# error \ + "Cannot use sv_pow_inline.h without specifying whether you need sign_bias." +#endif + +/* Data is defined in v_pow_log_data.c. */ +#define N_LOG (1 << V_POW_LOG_TABLE_BITS) +#define Off 0x3fe6955500000000 + +/* Data is defined in v_pow_exp_data.c. */ +#define N_EXP (1 << V_POW_EXP_TABLE_BITS) +#define SignBias (0x800 << V_POW_EXP_TABLE_BITS) +#define SmallExp 0x3c9 /* top12(0x1p-54). */ +#define BigExp 0x408 /* top12(512.). */ +#define ThresExp 0x03f /* BigExp - SmallExp. */ +#define HugeExp 0x409 /* top12(1024.). */ + +static const struct data +{ + double log_c0, log_c2, log_c4, log_c6, ln2_hi, ln2_lo; + double log_c1, log_c3, log_c5, off; + double n_over_ln2, exp_c2, ln2_over_n_hi, ln2_over_n_lo; + double exp_c0, exp_c1; +} data = { + .log_c0 = -0x1p-1, + .log_c1 = -0x1.555555555556p-1, + .log_c2 = 0x1.0000000000006p-1, + .log_c3 = 0x1.999999959554ep-1, + .log_c4 = -0x1.555555529a47ap-1, + .log_c5 = -0x1.2495b9b4845e9p0, + .log_c6 = 0x1.0002b8b263fc3p0, + .off = Off, + .exp_c0 = 0x1.fffffffffffd4p-2, + .exp_c1 = 0x1.5555571d6ef9p-3, + .exp_c2 = 0x1.5555576a5adcep-5, + .ln2_hi = 0x1.62e42fefa3800p-1, + .ln2_lo = 0x1.ef35793c76730p-45, + .n_over_ln2 = 0x1.71547652b82fep0 * N_EXP, + .ln2_over_n_hi = 0x1.62e42fefc0000p-9, + .ln2_over_n_lo = -0x1.c610ca86c3899p-45, +}; + +/* Check if x is an integer. */ +static inline svbool_t +sv_isint (svbool_t pg, svfloat64_t x) +{ + return svcmpeq (pg, svrintz_z (pg, x), x); +} + +/* Check if x is real not integer valued. */ +static inline svbool_t +sv_isnotint (svbool_t pg, svfloat64_t x) +{ + return svcmpne (pg, svrintz_z (pg, x), x); +} + +/* Check if x is an odd integer. */ +static inline svbool_t +sv_isodd (svbool_t pg, svfloat64_t x) +{ + svfloat64_t y = svmul_x (svptrue_b64 (), x, 0.5); + return sv_isnotint (pg, y); +} + +/* Returns 0 if not int, 1 if odd int, 2 if even int. The argument is + the bit representation of a non-zero finite floating-point value. */ +static inline int +checkint (uint64_t iy) +{ + int e = iy >> 52 & 0x7ff; + if (e < 0x3ff) + return 0; + if (e > 0x3ff + 52) + return 2; + if (iy & ((1ULL << (0x3ff + 52 - e)) - 1)) + return 0; + if (iy & (1ULL << (0x3ff + 52 - e))) + return 1; + return 2; +} + +/* Top 12 bits (sign and exponent of each double float lane). */ +static inline svuint64_t +sv_top12 (svfloat64_t x) +{ + return svlsr_x (svptrue_b64 (), svreinterpret_u64 (x), 52); +} + +/* Returns 1 if input is the bit representation of 0, infinity or nan. */ +static inline int +zeroinfnan (uint64_t i) +{ + return 2 * i - 1 >= 2 * asuint64 (INFINITY) - 1; +} + +/* Returns 1 if input is the bit representation of 0, infinity or nan. */ +static inline svbool_t +sv_zeroinfnan (svbool_t pg, svuint64_t i) +{ + return svcmpge (pg, svsub_x (pg, svadd_x (pg, i, i), 1), + 2 * asuint64 (INFINITY) - 1); +} + +/* Handle cases that may overflow or underflow when computing the result that + is scale*(1+TMP) without intermediate rounding. The bit representation of + scale is in SBITS, however it has a computed exponent that may have + overflown into the sign bit so that needs to be adjusted before using it as + a double. (int32_t)KI is the k used in the argument reduction and exponent + adjustment of scale, positive k here means the result may overflow and + negative k means the result may underflow. */ +static inline svfloat64_t +specialcase (svfloat64_t tmp, svuint64_t sbits, svuint64_t ki, svbool_t cmp) +{ + svbool_t p_pos = svcmpge_n_f64 (cmp, svreinterpret_f64_u64 (ki), 0.0); + + /* Scale up or down depending on sign of k. */ + svint64_t offset + = svsel_s64 (p_pos, sv_s64 (1009ull << 52), sv_s64 (-1022ull << 52)); + svfloat64_t factor + = svsel_f64 (p_pos, sv_f64 (0x1p1009), sv_f64 (0x1p-1022)); + + svuint64_t offset_sbits + = svsub_u64_x (cmp, sbits, svreinterpret_u64_s64 (offset)); + svfloat64_t scale = svreinterpret_f64_u64 (offset_sbits); + svfloat64_t res = svmad_f64_x (cmp, scale, tmp, scale); + return svmul_f64_x (cmp, res, factor); +} + +/* Compute y+TAIL = log(x) where the rounded result is y and TAIL has about + additional 15 bits precision. IX is the bit representation of x, but + normalized in the subnormal range using the sign bit for the exponent. */ +static inline svfloat64_t +sv_log_inline (svbool_t pg, svuint64_t ix, svfloat64_t *tail, + const struct data *d) +{ + /* x = 2^k z; where z is in range [Off,2*Off) and exact. + The range is split into N subintervals. + The ith subinterval contains z and c is near its center. */ + svuint64_t tmp = svsub_x (pg, ix, d->off); + svuint64_t i = svand_x (pg, svlsr_x (pg, tmp, 52 - V_POW_LOG_TABLE_BITS), + sv_u64 (N_LOG - 1)); + svint64_t k = svasr_x (pg, svreinterpret_s64 (tmp), 52); + svuint64_t iz = svsub_x (pg, ix, svlsl_x (pg, svreinterpret_u64 (k), 52)); + svfloat64_t z = svreinterpret_f64 (iz); + svfloat64_t kd = svcvt_f64_x (pg, k); + + /* log(x) = k*Ln2 + log(c) + log1p(z/c-1). */ + /* SVE lookup requires 3 separate lookup tables, as opposed to scalar version + that uses array of structures. We also do the lookup earlier in the code + to make sure it finishes as early as possible. */ + svfloat64_t invc = svld1_gather_index (pg, __v_pow_log_data.invc, i); + svfloat64_t logc = svld1_gather_index (pg, __v_pow_log_data.logc, i); + svfloat64_t logctail = svld1_gather_index (pg, __v_pow_log_data.logctail, i); + + /* Note: 1/c is j/N or j/N/2 where j is an integer in [N,2N) and + |z/c - 1| < 1/N, so r = z/c - 1 is exactly representible. */ + svfloat64_t r = svmad_x (pg, z, invc, -1.0); + /* k*Ln2 + log(c) + r. */ + + svfloat64_t ln2_hilo = svld1rq_f64 (svptrue_b64 (), &d->ln2_hi); + svfloat64_t t1 = svmla_lane_f64 (logc, kd, ln2_hilo, 0); + svfloat64_t t2 = svadd_x (pg, t1, r); + svfloat64_t lo1 = svmla_lane_f64 (logctail, kd, ln2_hilo, 1); + svfloat64_t lo2 = svadd_x (pg, svsub_x (pg, t1, t2), r); + + /* Evaluation is optimized assuming superscalar pipelined execution. */ + + svfloat64_t log_c02 = svld1rq_f64 (svptrue_b64 (), &d->log_c0); + svfloat64_t ar = svmul_lane_f64 (r, log_c02, 0); + svfloat64_t ar2 = svmul_x (svptrue_b64 (), r, ar); + svfloat64_t ar3 = svmul_x (svptrue_b64 (), r, ar2); + /* k*Ln2 + log(c) + r + A[0]*r*r. */ + svfloat64_t hi = svadd_x (pg, t2, ar2); + svfloat64_t lo3 = svmls_x (pg, ar2, ar, r); + svfloat64_t lo4 = svadd_x (pg, svsub_x (pg, t2, hi), ar2); + /* p = log1p(r) - r - A[0]*r*r. */ + /* p = (ar3 * (A[1] + r * A[2] + ar2 * (A[3] + r * A[4] + ar2 * (A[5] + r * + A[6])))). */ + + svfloat64_t log_c46 = svld1rq_f64 (svptrue_b64 (), &d->log_c4); + svfloat64_t a56 = svmla_lane_f64 (sv_f64 (d->log_c5), r, log_c46, 1); + svfloat64_t a34 = svmla_lane_f64 (sv_f64 (d->log_c3), r, log_c46, 0); + svfloat64_t a12 = svmla_lane_f64 (sv_f64 (d->log_c1), r, log_c02, 1); + svfloat64_t p = svmla_x (pg, a34, ar2, a56); + p = svmla_x (pg, a12, ar2, p); + p = svmul_x (svptrue_b64 (), ar3, p); + svfloat64_t lo = svadd_x ( + pg, svadd_x (pg, svsub_x (pg, svadd_x (pg, lo1, lo2), lo3), lo4), p); + svfloat64_t y = svadd_x (pg, hi, lo); + *tail = svadd_x (pg, svsub_x (pg, hi, y), lo); + return y; +} + +static inline svfloat64_t +sv_exp_core (svbool_t pg, svfloat64_t x, svfloat64_t xtail, + svuint64_t sign_bias, svfloat64_t *tmp, svuint64_t *sbits, + svuint64_t *ki, const struct data *d) +{ + /* exp(x) = 2^(k/N) * exp(r), with exp(r) in [2^(-1/2N),2^(1/2N)]. */ + /* x = ln2/N*k + r, with int k and r in [-ln2/2N, ln2/2N]. */ + svfloat64_t n_over_ln2_and_c2 = svld1rq_f64 (svptrue_b64 (), &d->n_over_ln2); + svfloat64_t z = svmul_lane_f64 (x, n_over_ln2_and_c2, 0); + /* z - kd is in [-1, 1] in non-nearest rounding modes. */ + svfloat64_t kd = svrinta_x (pg, z); + *ki = svreinterpret_u64 (svcvt_s64_x (pg, kd)); + + svfloat64_t ln2_over_n_hilo + = svld1rq_f64 (svptrue_b64 (), &d->ln2_over_n_hi); + svfloat64_t r = x; + r = svmls_lane_f64 (r, kd, ln2_over_n_hilo, 0); + r = svmls_lane_f64 (r, kd, ln2_over_n_hilo, 1); + /* The code assumes 2^-200 < |xtail| < 2^-8/N. */ + r = svadd_x (pg, r, xtail); + /* 2^(k/N) ~= scale. */ + svuint64_t idx = svand_x (pg, *ki, N_EXP - 1); + svuint64_t top + = svlsl_x (pg, svadd_x (pg, *ki, sign_bias), 52 - V_POW_EXP_TABLE_BITS); + /* This is only a valid scale when -1023*N < k < 1024*N. */ + *sbits = svld1_gather_index (pg, __v_pow_exp_data.sbits, idx); + *sbits = svadd_x (pg, *sbits, top); + /* exp(x) = 2^(k/N) * exp(r) ~= scale + scale * (exp(r) - 1). */ + svfloat64_t r2 = svmul_x (svptrue_b64 (), r, r); + *tmp = svmla_lane_f64 (sv_f64 (d->exp_c1), r, n_over_ln2_and_c2, 1); + *tmp = svmla_x (pg, sv_f64 (d->exp_c0), r, *tmp); + *tmp = svmla_x (pg, r, r2, *tmp); + svfloat64_t scale = svreinterpret_f64 (*sbits); + /* Note: tmp == 0 or |tmp| > 2^-200 and scale > 2^-739, so there + is no spurious underflow here even without fma. */ + z = svmla_x (pg, scale, scale, *tmp); + return z; +} + +/* Computes sign*exp(x+xtail) where |xtail| < 2^-8/N and |xtail| <= |x|. + The sign_bias argument is SignBias or 0 and sets the sign to -1 or 1. */ +static inline svfloat64_t +sv_exp_inline (svbool_t pg, svfloat64_t x, svfloat64_t xtail, + svuint64_t sign_bias, const struct data *d) +{ + /* 3 types of special cases: tiny (uflow and spurious uflow), huge (oflow) + and other cases of large values of x (scale * (1 + TMP) oflow). */ + svuint64_t abstop = svand_x (pg, sv_top12 (x), 0x7ff); + /* |x| is large (|x| >= 512) or tiny (|x| <= 0x1p-54). */ + svbool_t uoflow = svcmpge (pg, svsub_x (pg, abstop, SmallExp), ThresExp); + + svfloat64_t tmp; + svuint64_t sbits, ki; + if (__glibc_unlikely (svptest_any (pg, uoflow))) + { + svfloat64_t z + = sv_exp_core (pg, x, xtail, sign_bias, &tmp, &sbits, &ki, d); + + /* |x| is tiny (|x| <= 0x1p-54). */ + svbool_t uflow + = svcmpge (pg, svsub_x (pg, abstop, SmallExp), 0x80000000); + uflow = svand_z (pg, uoflow, uflow); + /* |x| is huge (|x| >= 1024). */ + svbool_t oflow = svcmpge (pg, abstop, HugeExp); + oflow = svand_z (pg, uoflow, svbic_z (pg, oflow, uflow)); + + /* Handle underflow and overlow in scale. + For large |x| values (512 < |x| < 1024), scale * (1 + TMP) can + overflow or underflow. */ + svbool_t special = svbic_z (pg, uoflow, svorr_z (pg, uflow, oflow)); + if (__glibc_unlikely (svptest_any (pg, special))) + z = svsel (special, specialcase (tmp, sbits, ki, special), z); + + /* Handle underflow and overflow in exp. */ + svbool_t x_is_neg = svcmplt (pg, x, 0); + svuint64_t sign_mask + = svlsl_x (pg, sign_bias, 52 - V_POW_EXP_TABLE_BITS); + svfloat64_t res_uoflow + = svsel (x_is_neg, sv_f64 (0.0), sv_f64 (INFINITY)); + res_uoflow = svreinterpret_f64 ( + svorr_x (pg, svreinterpret_u64 (res_uoflow), sign_mask)); + /* Avoid spurious underflow for tiny x. */ + svfloat64_t res_spurious_uflow + = svreinterpret_f64 (svorr_x (pg, sign_mask, 0x3ff0000000000000)); + + z = svsel (oflow, res_uoflow, z); + z = svsel (uflow, res_spurious_uflow, z); + return z; + } + + return sv_exp_core (pg, x, xtail, sign_bias, &tmp, &sbits, &ki, d); +} diff --git a/sysdeps/aarch64/fpu/sv_powf_inline.h b/sysdeps/aarch64/fpu/sv_powf_inline.h new file mode 100644 index 0000000000..4ba222bedf --- /dev/null +++ b/sysdeps/aarch64/fpu/sv_powf_inline.h @@ -0,0 +1,211 @@ +/* Helper for SVE single-precision pow + + Copyright (C) 2025 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifndef WANT_SV_POWF_SIGN_BIAS +# error \ + "Cannot use sv_powf_inline.h without specifying whether you need sign_bias." +#endif + +/* The following data is used in the SVE pow core computation + and special case detection. */ +#define Tinvc __v_powf_data.invc +#define Tlogc __v_powf_data.logc +#define Texp __v_powf_data.scale +#define SignBias (1 << (V_POWF_EXP2_TABLE_BITS + 11)) +static const struct data +{ + double log_poly[4]; + double exp_poly[3]; + float uflow_bound, oflow_bound, small_bound; + uint32_t sign_bias, subnormal_bias, off; +} data = { + /* rel err: 1.5 * 2^-30. Each coefficients is multiplied the value of + V_POWF_EXP2_N. */ + .log_poly = { -0x1.6ff5daa3b3d7cp+3, 0x1.ec81d03c01aebp+3, + -0x1.71547bb43f101p+4, 0x1.7154764a815cbp+5 }, + /* rel err: 1.69 * 2^-34. */ + .exp_poly = { + 0x1.c6af84b912394p-20, /* A0 / V_POWF_EXP2_N^3. */ + 0x1.ebfce50fac4f3p-13, /* A1 / V_POWF_EXP2_N^2. */ + 0x1.62e42ff0c52d6p-6, /* A3 / V_POWF_EXP2_N. */ + }, + .uflow_bound = -0x1.2cp+12f, /* -150.0 * V_POWF_EXP2_N. */ + .oflow_bound = 0x1p+12f, /* 128.0 * V_POWF_EXP2_N. */ + .small_bound = 0x1p-126f, + .off = 0x3f35d000, + .sign_bias = SignBias, + .subnormal_bias = 0x0b800000, /* 23 << 23. */ +}; + +/* Check if x is an integer. */ +static inline svbool_t +svisint (svbool_t pg, svfloat32_t x) +{ + return svcmpeq (pg, svrintz_z (pg, x), x); +} + +/* Check if x is real not integer valued. */ +static inline svbool_t +svisnotint (svbool_t pg, svfloat32_t x) +{ + return svcmpne (pg, svrintz_z (pg, x), x); +} + +/* Check if x is an odd integer. */ +static inline svbool_t +svisodd (svbool_t pg, svfloat32_t x) +{ + svfloat32_t y = svmul_x (pg, x, 0.5f); + return svisnotint (pg, y); +} + +/* Check if zero, inf or nan. */ +static inline svbool_t +sv_zeroinfnan (svbool_t pg, svuint32_t i) +{ + return svcmpge (pg, svsub_x (pg, svadd_x (pg, i, i), 1), + 2u * 0x7f800000 - 1); +} + +/* Returns 0 if not int, 1 if odd int, 2 if even int. The argument is + the bit representation of a non-zero finite floating-point value. */ +static inline int +checkint (uint32_t iy) +{ + int e = iy >> 23 & 0xff; + if (e < 0x7f) + return 0; + if (e > 0x7f + 23) + return 2; + if (iy & ((1 << (0x7f + 23 - e)) - 1)) + return 0; + if (iy & (1 << (0x7f + 23 - e))) + return 1; + return 2; +} + +/* Check if zero, inf or nan. */ +static inline int +zeroinfnan (uint32_t ix) +{ + return 2 * ix - 1 >= 2u * 0x7f800000 - 1; +} + +/* Compute core for half of the lanes in double precision. */ +static inline svfloat64_t +sv_powf_core_ext (const svbool_t pg, svuint64_t i, svfloat64_t z, svint64_t k, +#if WANT_SV_POWF_SIGN_BIAS + svfloat64_t y, svuint64_t sign_bias, svfloat64_t *pylogx, +#else + svfloat64_t y, svfloat64_t *pylogx, +#endif + const struct data *d) +{ + svfloat64_t invc = svld1_gather_index (pg, Tinvc, i); + svfloat64_t logc = svld1_gather_index (pg, Tlogc, i); + + /* log2(x) = log1p(z/c-1)/ln2 + log2(c) + k. */ + svfloat64_t r = svmla_x (pg, sv_f64 (-1.0), z, invc); + svfloat64_t y0 = svadd_x (pg, logc, svcvt_f64_x (pg, k)); + + /* Polynomial to approximate log1p(r)/ln2. */ + svfloat64_t logx = sv_f64 (d->log_poly[0]); + logx = svmad_x (pg, r, logx, sv_f64 (d->log_poly[1])); + logx = svmad_x (pg, r, logx, sv_f64 (d->log_poly[2])); + logx = svmad_x (pg, r, logx, sv_f64 (d->log_poly[3])); + logx = svmad_x (pg, r, logx, y0); + *pylogx = svmul_x (pg, y, logx); + + /* z - kd is in [-1, 1] in non-nearest rounding modes. */ + svfloat64_t kd = svrinta_x (svptrue_b64 (), *pylogx); + svuint64_t ki = svreinterpret_u64 (svcvt_s64_x (svptrue_b64 (), kd)); + + r = svsub_x (pg, *pylogx, kd); + + /* exp2(x) = 2^(k/N) * 2^r ~= s * (C0*r^3 + C1*r^2 + C2*r + 1). */ + svuint64_t t = svld1_gather_index ( + svptrue_b64 (), Texp, svand_x (svptrue_b64 (), ki, V_POWF_EXP2_N - 1)); +#if WANT_SV_POWF_SIGN_BIAS + svuint64_t ski = svadd_x (svptrue_b64 (), ki, sign_bias); + t = svadd_x (svptrue_b64 (), t, + svlsl_x (svptrue_b64 (), ski, 52 - V_POWF_EXP2_TABLE_BITS)); +#else + t = svadd_x (svptrue_b64 (), t, + svlsl_x (svptrue_b64 (), ki, 52 - V_POWF_EXP2_TABLE_BITS)); +#endif + svfloat64_t s = svreinterpret_f64 (t); + + svfloat64_t p = sv_f64 (d->exp_poly[0]); + p = svmla_x (pg, sv_f64 (d->exp_poly[1]), p, r); + p = svmla_x (pg, sv_f64 (d->exp_poly[2]), p, r); + p = svmla_x (pg, s, p, svmul_x (svptrue_b64 (), s, r)); + + return p; +} + +/* Widen vector to double precision and compute core on both halves of the + vector. Lower cost of promotion by considering all lanes active. */ +static inline svfloat32_t +sv_powf_core (const svbool_t pg, svuint32_t i, svuint32_t iz, svint32_t k, + svfloat32_t y, svuint32_t sign_bias, svfloat32_t *pylogx, + const struct data *d) +{ + const svbool_t ptrue = svptrue_b64 (); + + /* Unpack and promote input vectors (pg, y, z, i, k and sign_bias) into two + in order to perform core computation in double precision. */ + const svbool_t pg_lo = svunpklo (pg); + const svbool_t pg_hi = svunpkhi (pg); + svfloat64_t y_lo = svcvt_f64_x ( + ptrue, svreinterpret_f32 (svunpklo (svreinterpret_u32 (y)))); + svfloat64_t y_hi = svcvt_f64_x ( + ptrue, svreinterpret_f32 (svunpkhi (svreinterpret_u32 (y)))); + svfloat64_t z_lo = svcvt_f64_x (ptrue, svreinterpret_f32 (svunpklo (iz))); + svfloat64_t z_hi = svcvt_f64_x (ptrue, svreinterpret_f32 (svunpkhi (iz))); + svuint64_t i_lo = svunpklo (i); + svuint64_t i_hi = svunpkhi (i); + svint64_t k_lo = svunpklo (k); + svint64_t k_hi = svunpkhi (k); +#if WANT_SV_POWF_SIGN_BIAS + svuint64_t sign_bias_lo = svunpklo (sign_bias); + svuint64_t sign_bias_hi = svunpkhi (sign_bias); +#endif + + /* Compute each part in double precision. */ + svfloat64_t ylogx_lo, ylogx_hi; +#if WANT_SV_POWF_SIGN_BIAS + svfloat64_t lo = sv_powf_core_ext (pg_lo, i_lo, z_lo, k_lo, y_lo, + sign_bias_lo, &ylogx_lo, d); + svfloat64_t hi = sv_powf_core_ext (pg_hi, i_hi, z_hi, k_hi, y_hi, + sign_bias_hi, &ylogx_hi, d); +#else + svfloat64_t lo + = sv_powf_core_ext (pg_lo, i_lo, z_lo, k_lo, y_lo, &ylogx_lo, d); + svfloat64_t hi + = sv_powf_core_ext (pg_hi, i_hi, z_hi, k_hi, y_hi, &ylogx_hi, d); +#endif + + /* Convert back to single-precision and interleave. */ + svfloat32_t ylogx_lo_32 = svcvt_f32_x (ptrue, ylogx_lo); + svfloat32_t ylogx_hi_32 = svcvt_f32_x (ptrue, ylogx_hi); + *pylogx = svuzp1 (ylogx_lo_32, ylogx_hi_32); + svfloat32_t lo_32 = svcvt_f32_x (ptrue, lo); + svfloat32_t hi_32 = svcvt_f32_x (ptrue, hi); + return svuzp1 (lo_32, hi_32); +} diff --git a/sysdeps/aarch64/fpu/v_pow_inline.h b/sysdeps/aarch64/fpu/v_pow_inline.h new file mode 100644 index 0000000000..5020fda843 --- /dev/null +++ b/sysdeps/aarch64/fpu/v_pow_inline.h @@ -0,0 +1,193 @@ +/* Helper for AdvSIMD double-precision pow + + Copyright (C) 2025 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +/* This header defines parameters of the approximation and scalar fallback. */ +#include "finite_pow.h" + +static const struct data +{ + uint64x2_t inf; + float64x2_t subnormal_bound, subnormal_scale; + uint64x2_t subnormal_bias; + uint64x2_t offset, mask; + float64x2_t log_c0, log_c2, log_c4, log_c5; + double log_c1, log_c3; + double ln2_lo, ln2_hi; + uint64x2_t small_exp, thres_exp; + double ln2_lo_n, ln2_hi_n; + double inv_ln2_n, exp_c2; + float64x2_t exp_c0, exp_c1; +} data = { + /* Power thresholds. */ + .inf = V2 (0x7ff0000000000000), + .subnormal_bound = V2 (0x1p-1022), + /* Subnormal x update. */ + .subnormal_scale = V2 (0x1p52), + .subnormal_bias = V2 (52ULL << 52), + /* Constants for logarithm. */ + .offset = V2 (Off), + .mask = V2 (0xfffULL << 52), + /* Coefficients copied from v_pow_log_data.c + relative error: 0x1.11922ap-70 in [-0x1.6bp-8, 0x1.6bp-8] + Coefficients are scaled to match the scaling during evaluation. */ + .log_c0 = V2 (0x1.555555555556p-2 * -2), + .log_c1 = -0x1.0000000000006p-2 * -2, + .log_c2 = V2 (0x1.999999959554ep-3 * 4), + .log_c3 = -0x1.555555529a47ap-3 * 4, + .log_c4 = V2 (0x1.2495b9b4845e9p-3 * -8), + .log_c5 = V2 (-0x1.0002b8b263fc3p-3 * -8), + .ln2_hi = 0x1.62e42fefa3800p-1, + .ln2_lo = 0x1.ef35793c76730p-45, + /* Polynomial coefficients: abs error: 1.43*2^-58, ulp error: 0.549 + (0.550 without fma) if |x| < ln2/512. */ + .exp_c0 = V2 (0x1.fffffffffffd4p-2), + .exp_c1 = V2 (0x1.5555571d6ef9p-3), + .exp_c2 = 0x1.5555576a5adcep-5, + .small_exp = V2 (0x3c90000000000000), + .thres_exp = V2 (0x03f0000000000000), + .inv_ln2_n = 0x1.71547652b82fep8, /* N/ln2. */ + .ln2_hi_n = 0x1.62e42fefc0000p-9, /* ln2/N. */ + .ln2_lo_n = -0x1.c610ca86c3899p-45, +}; + +static inline float64x2_t VPCS_ATTR +v_masked_lookup_f64 (const double *table, uint64x2_t i) +{ + return (float64x2_t){ + table[(i[0] >> (52 - V_POW_LOG_TABLE_BITS)) & (N_LOG - 1)], + table[(i[1] >> (52 - V_POW_LOG_TABLE_BITS)) & (N_LOG - 1)] + }; +} + +/* Compute y+TAIL = log(x) where the rounded result is y and TAIL has about + additional 15 bits precision. IX is the bit representation of x, but + normalized in the subnormal range using the sign bit for the exponent. */ +static inline float64x2_t VPCS_ATTR +v_log_inline (uint64x2_t ix, float64x2_t *tail, const struct data *d) +{ + /* x = 2^k z; where z is in range [OFF,2*OFF) and exact. + The range is split into N subintervals. + The ith subinterval contains z and c is near its center. */ + uint64x2_t tmp = vsubq_u64 (ix, d->offset); + int64x2_t k = vshrq_n_s64 (vreinterpretq_s64_u64 (tmp), 52); + uint64x2_t iz = vsubq_u64 (ix, vandq_u64 (tmp, d->mask)); + float64x2_t z = vreinterpretq_f64_u64 (iz); + float64x2_t kd = vcvtq_f64_s64 (k); + /* log(x) = k*Ln2 + log(c) + log1p(z/c-1). */ + float64x2_t invc = v_masked_lookup_f64 (__v_pow_log_data.invc, tmp); + float64x2_t logc = v_masked_lookup_f64 (__v_pow_log_data.logc, tmp); + float64x2_t logctail = v_masked_lookup_f64 (__v_pow_log_data.logctail, tmp); + /* Note: 1/c is j/N or j/N/2 where j is an integer in [N,2N) and + |z/c - 1| < 1/N, so r = z/c - 1 is exactly representible. */ + float64x2_t r = vfmaq_f64 (v_f64 (-1.0), z, invc); + /* k*Ln2 + log(c) + r. */ + float64x2_t ln2 = vld1q_f64 (&d->ln2_lo); + float64x2_t t1 = vfmaq_laneq_f64 (logc, kd, ln2, 1); + float64x2_t t2 = vaddq_f64 (t1, r); + float64x2_t lo1 = vfmaq_laneq_f64 (logctail, kd, ln2, 0); + float64x2_t lo2 = vaddq_f64 (vsubq_f64 (t1, t2), r); + /* Evaluation is optimized assuming superscalar pipelined execution. */ + float64x2_t ar = vmulq_f64 (v_f64 (-0.5), r); + float64x2_t ar2 = vmulq_f64 (r, ar); + float64x2_t ar3 = vmulq_f64 (r, ar2); + /* k*Ln2 + log(c) + r + A[0]*r*r. */ + float64x2_t hi = vaddq_f64 (t2, ar2); + float64x2_t lo3 = vfmaq_f64 (vnegq_f64 (ar2), ar, r); + float64x2_t lo4 = vaddq_f64 (vsubq_f64 (t2, hi), ar2); + /* p = log1p(r) - r - A[0]*r*r. */ + float64x2_t odd_coeffs = vld1q_f64 (&d->log_c1); + float64x2_t a56 = vfmaq_f64 (d->log_c4, r, d->log_c5); + float64x2_t a34 = vfmaq_laneq_f64 (d->log_c2, r, odd_coeffs, 1); + float64x2_t a12 = vfmaq_laneq_f64 (d->log_c0, r, odd_coeffs, 0); + float64x2_t p = vfmaq_f64 (a34, ar2, a56); + p = vfmaq_f64 (a12, ar2, p); + p = vmulq_f64 (ar3, p); + float64x2_t lo + = vaddq_f64 (vaddq_f64 (vaddq_f64 (vaddq_f64 (lo1, lo2), lo3), lo4), p); + float64x2_t y = vaddq_f64 (hi, lo); + *tail = vaddq_f64 (vsubq_f64 (hi, y), lo); + return y; +} + +static float64x2_t VPCS_ATTR NOINLINE +exp_special_case (float64x2_t x, float64x2_t xtail) +{ + return (float64x2_t){ exp_inline (x[0], xtail[0], 0), + exp_inline (x[1], xtail[1], 0) }; +} + +/* Computes sign*exp(x+xtail) where |xtail| < 2^-8/N and |xtail| <= |x|. */ +static inline float64x2_t VPCS_ATTR +v_exp_inline (float64x2_t x, float64x2_t neg_xtail, const struct data *d) +{ + /* Fallback to scalar exp_inline for all lanes if any lane + contains value of x s.t. |x| <= 2^-54 or >= 512. */ + uint64x2_t uoflowx = vcgeq_u64 ( + vsubq_u64 (vreinterpretq_u64_f64 (vabsq_f64 (x)), d->small_exp), + d->thres_exp); + if (__glibc_unlikely (v_any_u64 (uoflowx))) + return exp_special_case (x, vnegq_f64 (neg_xtail)); + + /* exp(x) = 2^(k/N) * exp(r), with exp(r) in [2^(-1/2N),2^(1/2N)]. */ + /* x = ln2/N*k + r, with k integer and r in [-ln2/2N, ln2/2N]. */ + /* z - kd is in [-1, 1] in non-nearest rounding modes. */ + float64x2_t exp_consts = vld1q_f64 (&d->inv_ln2_n); + float64x2_t z = vmulq_laneq_f64 (x, exp_consts, 0); + float64x2_t kd = vrndnq_f64 (z); + uint64x2_t ki = vreinterpretq_u64_s64 (vcvtaq_s64_f64 (z)); + float64x2_t ln2_n = vld1q_f64 (&d->ln2_lo_n); + float64x2_t r = vfmsq_laneq_f64 (x, kd, ln2_n, 1); + r = vfmsq_laneq_f64 (r, kd, ln2_n, 0); + /* The code assumes 2^-200 < |xtail| < 2^-8/N. */ + r = vsubq_f64 (r, neg_xtail); + /* 2^(k/N) ~= scale. */ + uint64x2_t idx = vandq_u64 (ki, v_u64 (N_EXP - 1)); + uint64x2_t top = vshlq_n_u64 (ki, 52 - V_POW_EXP_TABLE_BITS); + /* This is only a valid scale when -1023*N < k < 1024*N. */ + uint64x2_t sbits = v_lookup_u64 (SBits, idx); + sbits = vaddq_u64 (sbits, top); + /* exp(x) = 2^(k/N) * exp(r) ~= scale + scale * (exp(r) - 1). */ + float64x2_t r2 = vmulq_f64 (r, r); + float64x2_t tmp = vfmaq_laneq_f64 (d->exp_c1, r, exp_consts, 1); + tmp = vfmaq_f64 (d->exp_c0, r, tmp); + tmp = vfmaq_f64 (r, r2, tmp); + float64x2_t scale = vreinterpretq_f64_u64 (sbits); + /* Note: tmp == 0 or |tmp| > 2^-200 and scale > 2^-739, so there + is no spurious underflow here even without fma. */ + return vfmaq_f64 (scale, scale, tmp); +} + +/* This version of AdvSIMD pow implements an algorithm close to AOR scalar pow + but: + - it does not prevent double-rounding in the exp's specialcase subroutine, + - it does not use a tail in the exponential core computation, + - and pow's exp polynomial order and table bits might differ. */ +static inline float64x2_t VPCS_ATTR +v_pow_inline (float64x2_t x, float64x2_t y, const struct data *d) +{ + /* Vector Log(ix, &lo). */ + float64x2_t vlo; + float64x2_t vhi = v_log_inline (vreinterpretq_u64_f64 (x), &vlo, d); + + /* Vector Exp(y_loghi, y_loglo). */ + float64x2_t vehi = vmulq_f64 (y, vhi); + float64x2_t vemi = vfmsq_f64 (vehi, y, vhi); + float64x2_t neg_velo = vfmsq_f64 (vemi, y, vlo); + return v_exp_inline (vehi, neg_velo, d); +} diff --git a/sysdeps/aarch64/fpu/v_powf_inline.h b/sysdeps/aarch64/fpu/v_powf_inline.h new file mode 100644 index 0000000000..f4c86bf403 --- /dev/null +++ b/sysdeps/aarch64/fpu/v_powf_inline.h @@ -0,0 +1,116 @@ +/* Helper for AdvSIMD single-precision pow + + Copyright (C) 2025 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +/* Define data structure and some routines specific to positive x. */ +#include "v_powrf_inline.h" + +static inline float64x2_t +exp2_core_with_sbias (const struct data *d, float64x2_t ylogx, + uint64x2_t sign_bias) +{ + /* N*x = k + r with r in [-1/2, 1/2]. */ + float64x2_t kd = vrndnq_f64 (ylogx); + int64x2_t ki = vcvtaq_s64_f64 (ylogx); + float64x2_t r = vsubq_f64 (ylogx, kd); + + /* exp2(x) = 2^(k/N) * 2^r ~= s * (C0*r^3 + C1*r^2 + C2*r + 1). */ + uint64x2_t t = vcombine_u64 (exp2_lookup (d, vgetq_lane_s64 (ki, 0)), + exp2_lookup (d, vgetq_lane_s64 (ki, 1))); + int64x2_t ski = vaddq_s64 (ki, vreinterpretq_s64_u64 (sign_bias)); + t = vaddq_u64 (t, vreinterpretq_u64_s64 ( + vshlq_n_s64 (ski, 52 - V_POWF_EXP2_TABLE_BITS))); + float64x2_t s = vreinterpretq_f64_u64 (t); + float64x2_t p = vfmaq_f64 (d->exp2_poly[1], r, d->exp2_poly[0]); + p = vfmaq_f64 (d->exp2_poly[2], r, p); + p = vfmaq_f64 (s, p, vmulq_f64 (s, r)); + return p; +} + +static inline float32x4_t +powf_core (const struct data *d, float32x4_t *ylogx, uint32x4_t tmp, + float32x4_t iz, float32x4_t y, int32x4_t k, uint32x4_t sign_bias) +{ + /* Use double precision for each lane: split input vectors into lo and hi + halves and promote. */ + float64x2_t tab0 = log2_lookup (d, vgetq_lane_u32 (tmp, 0)), + tab1 = log2_lookup (d, vgetq_lane_u32 (tmp, 1)), + tab2 = log2_lookup (d, vgetq_lane_u32 (tmp, 2)), + tab3 = log2_lookup (d, vgetq_lane_u32 (tmp, 3)); + + float64x2_t iz_lo = vcvt_f64_f32 (vget_low_f32 (iz)), + iz_hi = vcvt_high_f64_f32 (iz); + + float64x2_t k_lo = vcvtq_f64_s64 (vmovl_s32 (vget_low_s32 (k))), + k_hi = vcvtq_f64_s64 (vmovl_high_s32 (k)); + + float64x2_t invc_lo = vzip1q_f64 (tab0, tab1), + invc_hi = vzip1q_f64 (tab2, tab3), + logc_lo = vzip2q_f64 (tab0, tab1), + logc_hi = vzip2q_f64 (tab2, tab3); + + float64x2_t y_lo = vcvt_f64_f32 (vget_low_f32 (y)), + y_hi = vcvt_high_f64_f32 (y); + + float64x2_t ylogx_lo = ylogx_core (d, iz_lo, k_lo, invc_lo, logc_lo, y_lo); + float64x2_t ylogx_hi = ylogx_core (d, iz_hi, k_hi, invc_hi, logc_hi, y_hi); + + uint64x2_t sign_bias_lo = vmovl_u32 (vget_low_u32 (sign_bias)); + uint64x2_t sign_bias_hi = vmovl_high_u32 (sign_bias); + + float32x2_t p_lo + = vcvt_f32_f64 (exp2_core_with_sbias (d, ylogx_lo, sign_bias_lo)); + float32x2_t p_hi + = vcvt_f32_f64 (exp2_core_with_sbias (d, ylogx_hi, sign_bias_hi)); + + *ylogx = vcombine_f32 (vcvt_f32_f64 (ylogx_lo), vcvt_f32_f64 (ylogx_hi)); + + return vcombine_f32 (p_lo, p_hi); +} + +/* Power implementation without assumptions on x or y. + Evaluate powr(|x|) = exp (y * log(|x|)), and handle + sign of x using the sign bias. + Handle underflow and overflow in exponential. */ +static inline float32x4_t +v_powf_core (float32x4_t x, float32x4_t y, uint32x4_t sign_bias, + const struct data *d) +{ + uint32x4_t ix = vreinterpretq_u32_f32 (x); + + /* Part of core computation carried in working precision. */ + uint32x4_t tmp = vsubq_u32 (ix, d->off); + uint32x4_t top = vbicq_u32 (tmp, v_u32 (MantissaMask)); + float32x4_t iz = vreinterpretq_f32_u32 (vsubq_u32 (ix, top)); + int32x4_t k + = vshrq_n_s32 (vreinterpretq_s32_u32 (top), + 23 - V_POWF_EXP2_TABLE_BITS); /* arithmetic shift. */ + + /* Compute core in extended precision and return intermediate ylogx results + to handle cases of underflow and overflow in exp. */ + float32x4_t ylogx; + float32x4_t ret = powf_core (d, &ylogx, tmp, iz, y, k, sign_bias); + + /* Handle exp special cases of underflow and overflow. */ + uint32x4_t sign = vshlq_n_u32 (sign_bias, 20 - V_POWF_EXP2_TABLE_BITS); + float32x4_t ret_oflow = vreinterpretq_f32_u32 (vorrq_u32 (sign, d->inf)); + float32x4_t ret_uflow = vreinterpretq_f32_u32 (sign); + ret = vbslq_f32 (vcleq_f32 (ylogx, d->uflow_bound), ret_uflow, ret); + ret = vbslq_f32 (vcgtq_f32 (ylogx, d->oflow_bound), ret_oflow, ret); + return ret; +} diff --git a/sysdeps/aarch64/fpu/v_powrf_inline.h b/sysdeps/aarch64/fpu/v_powrf_inline.h new file mode 100644 index 0000000000..51f93691fb --- /dev/null +++ b/sysdeps/aarch64/fpu/v_powrf_inline.h @@ -0,0 +1,246 @@ +/* Helper for AdvSIMD single-precision powr + + Copyright (C) 2025 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define Log2IdxMask (V_POWF_LOG2_N - 1) +#define Exp2IdxMask (V_POWF_EXP2_N - 1) +#define Scale ((double) V_POWF_EXP2_N) +#define SignBias (1 << (V_POWF_EXP2_TABLE_BITS + 11)) + +#define MantissaMask 0x007fffff + +static const struct data +{ + uint32x4_t one, special_bound, sign_bias; + float32x4_t norm; + uint32x4_t subnormal_bias; + uint32x4_t off; + float32x4_t uflow_bound, oflow_bound; + uint32x4_t inf; + float32x4_t nan; + struct + { + double invc, logc; + } log2_tab[V_POWF_LOG2_N]; + float64x2_t log2_poly[4]; + uint64_t exp2_tab[V_POWF_EXP2_N]; + float64x2_t exp2_poly[3]; +} data = { + /* Table and polynomial for log2 approximation. */ + .log2_tab = { + {0x1.6489890582816p+0, -0x1.e960f97b22702p-2 * Scale}, + {0x1.5cf19b35e3472p+0, -0x1.c993406cd4db6p-2 * Scale}, + {0x1.55aac0e956d65p+0, -0x1.aa711d9a7d0f3p-2 * Scale}, + {0x1.4eb0022977e01p+0, -0x1.8bf37bacdce9bp-2 * Scale}, + {0x1.47fcccda1dd1fp+0, -0x1.6e13b3519946ep-2 * Scale}, + {0x1.418ceabab68c1p+0, -0x1.50cb8281e4089p-2 * Scale}, + {0x1.3b5c788f1edb3p+0, -0x1.341504a237e2bp-2 * Scale}, + {0x1.3567de48e9c9ap+0, -0x1.17eaab624ffbbp-2 * Scale}, + {0x1.2fabc80fd19bap+0, -0x1.f88e708f8c853p-3 * Scale}, + {0x1.2a25200ce536bp+0, -0x1.c24b6da113914p-3 * Scale}, + {0x1.24d108e0152e3p+0, -0x1.8d02ee397cb1dp-3 * Scale}, + {0x1.1facd8ab2fbe1p+0, -0x1.58ac1223408b3p-3 * Scale}, + {0x1.1ab614a03efdfp+0, -0x1.253e6fd190e89p-3 * Scale}, + {0x1.15ea6d03af9ffp+0, -0x1.e5641882c12ffp-4 * Scale}, + {0x1.1147b994bb776p+0, -0x1.81fea712926f7p-4 * Scale}, + {0x1.0ccbf650593aap+0, -0x1.203e240de64a3p-4 * Scale}, + {0x1.0875408477302p+0, -0x1.8029b86a78281p-5 * Scale}, + {0x1.0441d42a93328p+0, -0x1.85d713190fb9p-6 * Scale}, + {0x1p+0, 0x0p+0 * Scale}, + {0x1.f1d006c855e86p-1, 0x1.4c1cc07312997p-5 * Scale}, + {0x1.e28c3341aa301p-1, 0x1.5e1848ccec948p-4 * Scale}, + {0x1.d4bdf9aa64747p-1, 0x1.04cfcb7f1196fp-3 * Scale}, + {0x1.c7b45a24e5803p-1, 0x1.582813d463c21p-3 * Scale}, + {0x1.bb5f5eb2ed60ap-1, 0x1.a936fa68760ccp-3 * Scale}, + {0x1.afb0bff8fe6b4p-1, 0x1.f81bc31d6cc4ep-3 * Scale}, + {0x1.a49badf7ab1f5p-1, 0x1.2279a09fae6b1p-2 * Scale}, + {0x1.9a14a111fc4c9p-1, 0x1.47ec0b6df5526p-2 * Scale}, + {0x1.901131f5b2fdcp-1, 0x1.6c71762280f1p-2 * Scale}, + {0x1.8687f73f6d865p-1, 0x1.90155070798dap-2 * Scale}, + {0x1.7d7067eb77986p-1, 0x1.b2e23b1d3068cp-2 * Scale}, + {0x1.74c2c1cf97b65p-1, 0x1.d4e21b0daa86ap-2 * Scale}, + {0x1.6c77f37cff2a1p-1, 0x1.f61e2a2f67f3fp-2 * Scale}, + }, + .log2_poly = { /* rel err: 1.5 * 2^-30. */ + V2 (-0x1.6ff5daa3b3d7cp-2 * Scale), + V2 (0x1.ec81d03c01aebp-2 * Scale), + V2 (-0x1.71547bb43f101p-1 * Scale), + V2 (0x1.7154764a815cbp0 * Scale) + }, + /* Table and polynomial for exp2 approximation. */ + .exp2_tab = { + 0x3ff0000000000000, 0x3fefd9b0d3158574, 0x3fefb5586cf9890f, + 0x3fef9301d0125b51, 0x3fef72b83c7d517b, 0x3fef54873168b9aa, + 0x3fef387a6e756238, 0x3fef1e9df51fdee1, 0x3fef06fe0a31b715, + 0x3feef1a7373aa9cb, 0x3feedea64c123422, 0x3feece086061892d, + 0x3feebfdad5362a27, 0x3feeb42b569d4f82, 0x3feeab07dd485429, + 0x3feea47eb03a5585, 0x3feea09e667f3bcd, 0x3fee9f75e8ec5f74, + 0x3feea11473eb0187, 0x3feea589994cce13, 0x3feeace5422aa0db, + 0x3feeb737b0cdc5e5, 0x3feec49182a3f090, 0x3feed503b23e255d, + 0x3feee89f995ad3ad, 0x3feeff76f2fb5e47, 0x3fef199bdd85529c, + 0x3fef3720dcef9069, 0x3fef5818dcfba487, 0x3fef7c97337b9b5f, + 0x3fefa4afa2a490da, 0x3fefd0765b6e4540, + }, + .exp2_poly = { /* rel err: 1.69 * 2^-34. */ + V2 (0x1.c6af84b912394p-5 / Scale / Scale / Scale), + V2 (0x1.ebfce50fac4f3p-3 / Scale / Scale), + V2 (0x1.62e42ff0c52d6p-1 / Scale), + }, + .one = V4 (1), + .special_bound = V4 (2u * 0x7f800000 - 1), + .norm = V4 (0x1p23f), + .subnormal_bias = V4 (0x0b800000), /* 23 << 23. */ + .off = V4 (0x3f35d000), + .sign_bias = V4 (SignBias), + .inf = V4 (0x7f800000), + .nan = V4 (__builtin_nanf ("")), + /* 2.6 ulp ~ 0.5 + 2^24 (128*Ln2*relerr_log2 + relerr_exp2). */ + .uflow_bound = V4 (-0x1.2cp+12f), /* -150.0 * V_POWF_EXP2_N. */ + .oflow_bound = V4 (0x1p+12f), /* 128.0 * V_POWF_EXP2_N. */ +}; + +/* Check if zero, inf or nan. */ +static inline uint32x4_t +v_zeroinfnan (const struct data *d, uint32x4_t i) +{ + return vcgeq_u32 (vsubq_u32 (vaddq_u32 (i, i), d->one), d->special_bound); +} + +/* Check if zero, inf or nan. */ +static inline int +zeroinfnan (uint32_t ix) +{ + return 2 * ix - 1 >= 2u * 0x7f800000 - 1; +} + +static inline float64x2_t +ylogx_core (const struct data *d, float64x2_t iz, float64x2_t k, + float64x2_t invc, float64x2_t logc, float64x2_t y) +{ + + /* log2(x) = log1p(z/c-1)/ln2 + log2(c) + k. */ + float64x2_t r = vfmaq_f64 (v_f64 (-1.0), iz, invc); + float64x2_t y0 = vaddq_f64 (logc, k); + + /* Polynomial to approximate log1p(r)/ln2. */ + float64x2_t logx = vfmaq_f64 (d->log2_poly[1], r, d->log2_poly[0]); + logx = vfmaq_f64 (d->log2_poly[2], logx, r); + logx = vfmaq_f64 (d->log2_poly[3], logx, r); + logx = vfmaq_f64 (y0, logx, r); + + return vmulq_f64 (logx, y); +} + +static inline float64x2_t +log2_lookup (const struct data *d, uint32_t i) +{ + return vld1q_f64 ( + &d->log2_tab[(i >> (23 - V_POWF_LOG2_TABLE_BITS)) & Log2IdxMask].invc); +} + +static inline uint64x1_t +exp2_lookup (const struct data *d, uint64_t i) +{ + return vld1_u64 (&d->exp2_tab[i & Exp2IdxMask]); +} + +static inline float64x2_t +exp2_core (const struct data *d, float64x2_t ylogx) +{ + /* N*x = k + r with r in [-1/2, 1/2]. */ + float64x2_t kd = vrndnq_f64 (ylogx); + int64x2_t ki = vcvtaq_s64_f64 (ylogx); + float64x2_t r = vsubq_f64 (ylogx, kd); + + /* exp2(x) = 2^(k/N) * 2^r ~= s * (C0*r^3 + C1*r^2 + C2*r + 1). */ + uint64x2_t t = vcombine_u64 (exp2_lookup (d, vgetq_lane_s64 (ki, 0)), + exp2_lookup (d, vgetq_lane_s64 (ki, 1))); + t = vaddq_u64 (t, vreinterpretq_u64_s64 ( + vshlq_n_s64 (ki, 52 - V_POWF_EXP2_TABLE_BITS))); + float64x2_t s = vreinterpretq_f64_u64 (t); + float64x2_t p = vfmaq_f64 (d->exp2_poly[1], r, d->exp2_poly[0]); + p = vfmaq_f64 (d->exp2_poly[2], r, p); + p = vfmaq_f64 (s, p, vmulq_f64 (s, r)); + return p; +} + +static inline float32x4_t +powrf_core (const struct data *d, float32x4_t *ylogx, uint32x4_t tmp, + float32x4_t iz, float32x4_t y, int32x4_t k) +{ + /* Use double precision for each lane: split input vectors into lo and hi + halves and promote. */ + float64x2_t tab0 = log2_lookup (d, vgetq_lane_u32 (tmp, 0)), + tab1 = log2_lookup (d, vgetq_lane_u32 (tmp, 1)), + tab2 = log2_lookup (d, vgetq_lane_u32 (tmp, 2)), + tab3 = log2_lookup (d, vgetq_lane_u32 (tmp, 3)); + + float64x2_t iz_lo = vcvt_f64_f32 (vget_low_f32 (iz)), + iz_hi = vcvt_high_f64_f32 (iz); + + float64x2_t k_lo = vcvtq_f64_s64 (vmovl_s32 (vget_low_s32 (k))), + k_hi = vcvtq_f64_s64 (vmovl_high_s32 (k)); + + float64x2_t invc_lo = vzip1q_f64 (tab0, tab1), + invc_hi = vzip1q_f64 (tab2, tab3), + logc_lo = vzip2q_f64 (tab0, tab1), + logc_hi = vzip2q_f64 (tab2, tab3); + + float64x2_t y_lo = vcvt_f64_f32 (vget_low_f32 (y)), + y_hi = vcvt_high_f64_f32 (y); + + float64x2_t ylogx_lo = ylogx_core (d, iz_lo, k_lo, invc_lo, logc_lo, y_lo); + float64x2_t ylogx_hi = ylogx_core (d, iz_hi, k_hi, invc_hi, logc_hi, y_hi); + + float32x2_t p_lo = vcvt_f32_f64 (exp2_core (d, ylogx_lo)); + float32x2_t p_hi = vcvt_f32_f64 (exp2_core (d, ylogx_hi)); + + *ylogx = vcombine_f32 (vcvt_f32_f64 (ylogx_lo), vcvt_f32_f64 (ylogx_hi)); + + return vcombine_f32 (p_lo, p_hi); +} + +/* Power implementation without assumptions on x or y. + Evaluate powr(|x|) = exp (y * log(|x|)), and handle + sign of x using the sign bias. + Handle underflow and overflow in exponential. */ +static inline float32x4_t +v_powrf_core (float32x4_t x, float32x4_t y, const struct data *d) +{ + uint32x4_t ix = vreinterpretq_u32_f32 (x); + + /* Part of core computation carried in working precision. */ + uint32x4_t tmp = vsubq_u32 (ix, d->off); + uint32x4_t top = vbicq_u32 (tmp, v_u32 (MantissaMask)); + float32x4_t iz = vreinterpretq_f32_u32 (vsubq_u32 (ix, top)); + int32x4_t k + = vshrq_n_s32 (vreinterpretq_s32_u32 (top), + 23 - V_POWF_EXP2_TABLE_BITS); /* arithmetic shift. */ + + /* Compute core in extended precision and return intermediate ylogx results + to handle cases of underflow and overflow in exp. */ + float32x4_t ylogx; + float32x4_t ret = powrf_core (d, &ylogx, tmp, iz, y, k); + + /* Handle exp special cases of underflow and overflow. */ + float32x4_t ret_oflow = vreinterpretq_f32_u32 (d->inf); + float32x4_t ret_uflow = v_f32 (0); + ret = vbslq_f32 (vcleq_f32 (ylogx, d->uflow_bound), ret_uflow, ret); + ret = vbslq_f32 (vcgtq_f32 (ylogx, d->oflow_bound), ret_oflow, ret); + return ret; +}