From patchwork Thu May 14 12:31:57 2026 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yanan Zhou X-Patchwork-Id: 134962 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from vm01.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 668304B196E3 for ; Thu, 14 May 2026 12:38:09 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 668304B196E3 X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mxhk.zte.com.cn (mxhk.zte.com.cn [160.30.148.34]) by sourceware.org (Postfix) with ESMTPS id 7ED854B920AF for ; Thu, 14 May 2026 12:37:38 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7ED854B920AF Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=zte.com.cn Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=zte.com.cn ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 7ED854B920AF Authentication-Results: sourceware.org; arc=none smtp.remote-ip=160.30.148.34 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1778762259; cv=none; b=egFv8+jYE2OCBvEc9PvQnIOO/s5Tbd5qf3Fup/Pbmyw3SgBAR7VeC5Yp3O36VY0UQgQU0qhw4MpI2xWK6WeYhNflJHwbwoVMv03xDLKdWo2DGsmTo++THfD7TxB5y0IjLYRPmGgRg+7Gga8dgkRZmN0gRTpra4KT+U+0zDZnF34= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1778762259; c=relaxed/simple; bh=1Xaf31EZFfMFc8QhGgi9pwetAkDHoa1W6RlLGFlj2xU=; h=Message-Id:Date:Mime-Version:From:To:Subject; b=ZhA6s7gxMpIxm3f5EqRpOYMBSbK8P7XhNaWcX8F63pKdx83OydqcO29KHpVOOcf3sTCulOKR5N5OsJUjUHYBEqPX5mUoJWr6gxI+WtoQeeVBKvcbJW/NPpEq3/owH3LN2z/nLvxDBcl8ZzvnzTWtfi0MSjpPP5WVr9JXNLjpqpE= ARC-Authentication-Results: i=1; sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7ED854B920AF Received: from mse-db.zte.com.cn (unknown [10.5.228.131]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mxhk.zte.com.cn (FangMail) with ESMTPS id 4gGVGN3dX1z5B100; Thu, 14 May 2026 20:37:32 +0800 (CST) Received: (from root@localhost) by mse-db.zte.com.cn id 64ECbTin003105; Thu, 14 May 2026 20:37:29 +0800 (+08) (envelope-from zhou.yanan@zte.com.cn) Message-Id: <202605141237.64ECbTin003105@mse-db.zte.com.cn> Received: from szxlzmapp01.zte.com.cn ([10.5.231.85]) by mse-fl2.zte.com.cn with SMTP id 64ECVsot056392; Thu, 14 May 2026 20:31:54 +0800 (+08) (envelope-from zhou.yanan@zte.com.cn) Received: from mapi (szxlzmapp04[null]) by mapi (Zmail) with MAPI id mid18; Thu, 14 May 2026 20:31:57 +0800 (CST) X-Zmail-TransId: 2b066a05c0bde45-52646 X-Mailer: Zmail v1.0 Date: Thu, 14 May 2026 20:31:57 +0800 (CST) Mime-Version: 1.0 From: To: Cc: , , , , , , , , , , , , Subject: =?utf-8?q?=5BRFC_PATCH_2/2=5D_RISC-V=3A_Improve_RVV_libmvec_double-?= =?utf-8?q?precision_exp_performance?= X-MAIL: mse-db.zte.com.cn 64ECbTin003105 X-MSS: AUDITRELEASE@mse-db.zte.com.cn X-TLS: YES X-SPF-DOMAIN: zte.com.cn X-ENVELOPE-SENDER: zhou.yanan@zte.com.cn X-SPF: None X-SOURCE-IP: 10.5.228.131 unknown Thu, 14 May 2026 20:37:32 +0800 X-Fangmail-Anti-Spam-Filtered: true X-Fangmail-MID-QID: 6A05C20C.000/4gGVGN3dX1z5B100 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, MSGID_FROM_MTA_HEADER, SPF_HELO_NONE, SPF_PASS, TXREP, UNPARSEABLE_RELAY shortcircuit=no autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org In-Reply-To: References: This is the second patch in the series, applying on top of [PATCH 1/2]. It replaces the exp() in sysdeps/riscv/rvd/v_d_exp.c with an algorithm adapted from the ARM Optimized Routines, improving performance. Measured on K1 (VLEN=256, exp() loop, N=1M, REPEAT=1000), compiled with -Ofast -march=rv64gcv_zvl256b -fopenmp -static -lmvec -lm : My implementation: 9.92 s (1.79x faster than baseline) Rivos v1: 17.79 s (baseline) Rivos v2: 23.47 s Rivos v3: 23.05 s Rivos v4: 16.14 s --- sysdeps/riscv/rvd/v_d_exp.c | 168 ++++++++++++++++++++---------------- 1 file changed, 95 insertions(+), 73 deletions(-) diff --git a/sysdeps/riscv/rvd/v_d_exp.c b/sysdeps/riscv/rvd/v_d_exp.c index 2575e454..8b20fefe 100644 --- a/sysdeps/riscv/rvd/v_d_exp.c +++ b/sysdeps/riscv/rvd/v_d_exp.c @@ -28,94 +28,116 @@ #define COMPILE_FOR_EXP -#define EXCEPTION_HANDLING_EXP(vx, special_args, vy_special, vlen) \ - do \ - { \ - VUINT vclass = __riscv_vfclass ((vx), (vlen)); \ - IDENTIFY (vclass, class_NaN | class_Inf, (special_args), (vlen)); \ - UINT nb_special_args = __riscv_vcpop ((special_args), (vlen)); \ - if (nb_special_args > 0) \ - { \ - /* Substitute -Inf with +0 */ \ - VBOOL id_mask; \ - IDENTIFY (vclass, class_negInf, id_mask, (vlen)); \ - vx = __riscv_vfmerge (vx, fp_posZero, id_mask, (vlen)); \ - vy_special = __riscv_vfadd ((special_args), (vx), (vx), (vlen)); \ - vx = __riscv_vfmerge ((vx), fp_posZero, (special_args), (vlen)); \ - } \ - } \ - while (0) - -#define P_INV_STD 0x1.71547652b82fep+0 -#define P_HI_STD 0x1.62e42fefa39efp-1 -#define P_LO_STD 0x1.abc9e3b39803fp-56 -#define P_INV_TBL 0x1.71547652b82fep+6 -#define P_HI_TBL 0x1.62e42fefa39efp-7 -#define P_LO_TBL 0x1.abc9e3b39803fp-62 -#define X_MAX 0x1.65p+9 -#define X_MIN -0x1.77p+9 +#define EXP_SCALE 7 +#define EXP_TABLE_SIZE 128 +#define EXP_MASK 0x7F + +static const double inv_ln2 = 0x1.71547652b82fep7; +static const double ln2_hi = 0x1.62e42fefa39efp-8; +static const double ln2_lo = 0x1.abc9e3b39803f3p-63; +static const double shift = 0x1.8p+52; + +static const double a0 = 0x1.ffffffffffd43p-2; +static const double a1 = 0x1.55555c75adbb2p-3; +static const double a2 = 0x1.55555da646206p-5; + +static const uint64_t exp_tab_64f[EXP_TABLE_SIZE] = { + 0x3ff0000000000000, 0x3feff63da9fb3335, 0x3fefec9a3e778061, + 0x3fefe315e86e7f85, 0x3fefd9b0d3158574, 0x3fefd06b29ddf6de, + 0x3fefc74518759bc8, 0x3fefbe3ecac6f383, 0x3fefb5586cf9890f, + 0x3fefac922b7247f7, 0x3fefa3ec32d3d1a2, 0x3fef9b66affed31b, + 0x3fef9301d0125b51, 0x3fef8abdc06c31cc, 0x3fef829aaea92de0, + 0x3fef7a98c8a58e51, 0x3fef72b83c7d517b, 0x3fef6af9388c8dea, + 0x3fef635beb6fcb75, 0x3fef5be084045cd4, 0x3fef54873168b9aa, + 0x3fef4d5022fcd91d, 0x3fef463b88628cd6, 0x3fef3f49917ddc96, + 0x3fef387a6e756238, 0x3fef31ce4fb2a63f, 0x3fef2b4565e27cdd, + 0x3fef24dfe1f56381, 0x3fef1e9df51fdee1, 0x3fef187fd0dad990, + 0x3fef1285a6e4030b, 0x3fef0cafa93e2f56, 0x3fef06fe0a31b715, + 0x3fef0170fc4cd831, 0x3feefc08b26416ff, 0x3feef6c55f929ff1, + 0x3feef1a7373aa9cb, 0x3feeecae6d05d866, 0x3feee7db34e59ff7, + 0x3feee32dc313a8e5, 0x3feedea64c123422, 0x3feeda4504ac801c, + 0x3feed60a21f72e2a, 0x3feed1f5d950a897, 0x3feece086061892d, + 0x3feeca41ed1d0057, 0x3feec6a2b5c13cd0, 0x3feec32af0d7d3de, + 0x3feebfdad5362a27, 0x3feebcb299fddd0d, 0x3feeb9b2769d2ca7, + 0x3feeb6daa2cf6642, 0x3feeb42b569d4f82, 0x3feeb1a4ca5d920f, + 0x3feeaf4736b527da, 0x3feead12d497c7fd, 0x3feeab07dd485429, + 0x3feea9268a5946b7, 0x3feea76f15ad2148, 0x3feea5e1b976dc09, + 0x3feea47eb03a5585, 0x3feea34634ccc320, 0x3feea23882552225, + 0x3feea155d44ca973, 0x3feea09e667f3bcd, 0x3feea012750bdabf, + 0x3fee9fb23c651a2f, 0x3fee9f7df9519484, 0x3fee9f75e8ec5f74, + 0x3fee9f9a48a58174, 0x3fee9feb564267c9, 0x3feea0694fde5d3f, + 0x3feea11473eb0187, 0x3feea1ed0130c132, 0x3feea2f336cf4e62, + 0x3feea427543e1a12, 0x3feea589994cce13, 0x3feea71a4623c7ad, + 0x3feea8d99b4492ed, 0x3feeaac7d98a6699, 0x3feeace5422aa0db, + 0x3feeaf3216b5448c, 0x3feeb1ae99157736, 0x3feeb45b0b91ffc6, + 0x3feeb737b0cdc5e5, 0x3feeba44cbc8520f, 0x3feebd829fde4e50, + 0x3feec0f170ca07ba, 0x3feec49182a3f090, 0x3feec86319e32323, + 0x3feecc667b5de565, 0x3feed09bec4a2d33, 0x3feed503b23e255d, + 0x3feed99e1330b358, 0x3feede6b5579fdbf, 0x3feee36bbfd3f37a, + 0x3feee89f995ad3ad, 0x3feeee07298db666, 0x3feef3a2b84f15fb, + 0x3feef9728de5593a, 0x3feeff76f2fb5e47, 0x3fef05b030a1064a, + 0x3fef0c1e904bc1d2, 0x3fef12c25bd71e09, 0x3fef199bdd85529c, + 0x3fef20ab5fffd07a, 0x3fef27f12e57d14b, 0x3fef2f6d9406e7b5, + 0x3fef3720dcef9069, 0x3fef3f0b555dc3fa, 0x3fef472d4a07897c, + 0x3fef4f87080d89f2, 0x3fef5818dcfba487, 0x3fef60e316c98398, + 0x3fef69e603db3285, 0x3fef7321f301b460, 0x3fef7c97337b9b5f, + 0x3fef864614f5a129, 0x3fef902ee78b3ff6, 0x3fef9a51fbc74c83, + 0x3fefa4afa2a490da, 0x3fefaf482d8e67f1, 0x3fefba1bee615a27, + 0x3fefc52b376bba97, 0x3fefd0765b6e4540, 0x3fefdbfdad9cbe14, + 0x3fefe7c1819e90d8, 0x3feff3c22b8f71f1, +}; #define V_NAME_FUNCTION(lmul, simdlen) \ VFLOAT V_NAME_D1 (lmul, simdlen, exp) (VFLOAT x) \ { \ - size_t vl; \ - VFLOAT vx, vy, vy_special; \ - VBOOL special_args; \ - \ - SET_ROUNDTONEAREST; \ - vl = VSET (simdlen); \ - vx = x; \ - /* Set results for input of NaN and Inf; substitute them with zero */ \ - EXCEPTION_HANDLING_EXP (vx, special_args, vy_special, vl); \ - \ - /* Clip */ \ - vx = FCLIP (vx, X_MIN, X_MAX, vl); \ - \ - /* Argument reduction */ \ - VFLOAT flt_n = __riscv_vfmul (vx, P_INV_STD, vl); \ - VINT n = __riscv_vfcvt_x (flt_n, vl); \ - flt_n = __riscv_vfcvt_f (n, vl); \ - VFLOAT r = __riscv_vfnmsac (vx, P_HI_STD, flt_n, vl); \ - \ - r = __riscv_vfnmsac (r, P_LO_STD, flt_n, vl); \ + size_t vl = VSET (simdlen); \ + VFLOAT x_abs = __riscv_vfabs (x, vl); \ + VBOOL mask = __riscv_vmfgt (x_abs, 708, vl); \ \ - /* Polynomial computation, we have a degree 11 \ - We compute the part from r^3 in three segments, increasing parallelism \ - Ideally the compiler will interleave the computations of the segments \ - */ \ - VFLOAT poly_right = PSTEP ( \ - 0x1.71df804f1baa1p-19, r, \ - PSTEP (0x1.28aa3ea739296p-22, 0x1.acf86201fd199p-26, r, vl), vl); \ + VFLOAT vz = __riscv_vfadd (__riscv_vfmul (x, inv_ln2, vl), shift, vl); \ + VUINT vu = F_AS_U (vz); \ + VFLOAT vn = __riscv_vfsub (vz, shift, vl); \ \ - VFLOAT poly_mid = PSTEP ( \ - 0x1.6c16c1825c970p-10, r, \ - PSTEP (0x1.a01a00fe6f730p-13, 0x1.a0199e1789c72p-16, r, vl), vl); \ + VFLOAT r = __riscv_vfnmsub (vn, ln2_hi, x, vl); \ + r = __riscv_vfnmsub (vn, ln2_lo, r, vl); \ \ - VFLOAT poly_left = PSTEP ( \ - 0x1.55555555554d2p-3, r, \ - PSTEP (0x1.5555555551307p-5, 0x1.11111111309a4p-7, r, vl), vl); \ + VFLOAT r2 = __riscv_vfmul (r, r, vl); \ + VFLOAT y = __riscv_vfadd (__riscv_vfmul (r, a1, vl), a0, vl); \ + y = __riscv_vfmadd (r2, a2, y, vl); \ + y = __riscv_vfmadd (y, r2, r, vl); \ \ - VFLOAT r_sq = __riscv_vfmul (r, r, vl); \ - VFLOAT r_cube = __riscv_vfmul (r_sq, r, vl); \ + VUINT idx = __riscv_vsll (__riscv_vand (vu, EXP_MASK, vl), 3, vl); \ + VUINT e = __riscv_vsll (vu, 45, vl); \ + VUINT tbl = __riscv_vloxei64 (exp_tab_64f, idx, vl); \ + VFLOAT s = U_AS_F (__riscv_vadd (tbl, e, vl)); \ \ - VFLOAT poly = __riscv_vfmadd (poly_right, r_cube, poly_mid, vl); \ - poly = __riscv_vfmadd (poly, r_cube, poly_left, vl); \ + VFLOAT ret = __riscv_vfmadd (s, y, s, vl); \ \ - poly = PSTEP (0x1.0000000000007p-1, r, poly, vl); \ + if (__riscv_vcpop (mask, vl) > 0) \ + { \ + VBOOL n_neg = __riscv_vmflt (vn, 0.0, vl); \ + VUINT b = __riscv_vmerge (VMVU_VX (0, vl), \ + 0x6000000000000000ULL, n_neg, vl); \ \ - r = __riscv_vfmacc (r, r_sq, poly, vl); \ - vy = __riscv_vfadd (r, 0x1.0p0, vl); \ + VUINT u_s1 = __riscv_vrsub (b, 0x7000000000000000ULL, vl); \ + VFLOAT vs1 = U_AS_F (u_s1); \ \ - /* at this point, vy is the entire degree-11 polynomial vy ~=~ exp(r) */ \ + VUINT u_s_raw = F_AS_U (s); \ + VUINT u_s2 = __riscv_vsub (u_s_raw, 0x3010000000000000ULL, vl); \ + u_s2 = __riscv_vadd (u_s2, b, vl); \ + VFLOAT vs2 = U_AS_F (u_s2); \ \ - /* Need to compute 2^n * exp(r).*/ \ - FAST_LDEXP (vy, n, vl); \ + VFLOAT vr0 = __riscv_vfmadd (vs2, y, vs2, vl); \ + vr0 = __riscv_vfmul (vr0, vs1, vl); \ \ - /* Incorporate results of exceptional inputs */ \ - vy = __riscv_vmerge (vy, vy_special, special_args, vl); \ + VFLOAT n_abs = __riscv_vfabs (vn, vl); \ + VBOOL p_cmp = __riscv_vmfgt (n_abs, 163840.0, vl); \ + VFLOAT vr1 = __riscv_vfmul (vs1, vs1, vl); \ \ - RESTORE_FRM; \ - return vy; \ + VFLOAT ret_special = __riscv_vmerge (vr0, vr1, p_cmp, vl); \ + ret = __riscv_vmerge (ret, ret_special, mask, vl); \ + } \ + return ret; \ } #undef LMUL