From patchwork Sat Jun 6 18:36:23 2026 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fabian Rast X-Patchwork-Id: 136633 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from vm01.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id BDA254C31814 for ; Sat, 6 Jun 2026 18:40:36 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BDA254C31814 Authentication-Results: sourceware.org; dkim=pass (2048-bit key, secure) header.d=tum.de header.i=@tum.de header.a=rsa-sha256 header.s=tu-postout21 header.b=SByKgt8I X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from postout1.mail.lrz.de (postout1.mail.lrz.de [IPv6:2001:4ca0:0:103::81bb:ff89]) by sourceware.org (Postfix) with ESMTPS id 0F0A14BA2E21 for ; Sat, 6 Jun 2026 18:36:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0F0A14BA2E21 Authentication-Results: sourceware.org; dmarc=pass (p=quarantine dis=none) header.from=tum.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=tum.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 0F0A14BA2E21 Authentication-Results: sourceware.org; arc=none smtp.remote-ip=2001:4ca0:0:103::81bb:ff89 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1780770995; cv=none; b=UB0gFezWJufPqSwYEvEhRX7UWqcSFSaZpYdj4Y0xkuDeF4LoddqQ2QSDe7+bUVXzEJGuEuP1LstV9m/ddGjur5+wbuJ9g6r95EjD52sdgv192LVA+GKRB4bgO3KSOWYdOxim/zKLa4q8m8txWYsvuzIqGKr7kT6ZZK6xdhlM/44= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1780770995; c=relaxed/simple; bh=ONFvqtuW8tAeiooaVqKGOgYtt58GQqV7m2KPQHRhL3A=; h=DKIM-Signature:Mime-Version:Date:Message-Id:Subject:To:From; b=McxPxjeeXePUvQBk8O3w3Am2c4i9DmlltW43spISFnUVwcGRGNlUgjv8dBj8M7qhubTq5ZPGBfH94jyK3dDvFXey8EaM2GTvzjTva+tMAXIRpaN5EvjZyTZlNEdrhOm+sFk6xDvc7nvYWdtUvi6UTKh08s59kpMH2INnZPDsUP0= ARC-Authentication-Results: i=1; sourceware.org; dkim=pass (2048-bit key, secure) header.d=tum.de header.i=@tum.de header.a=rsa-sha256 header.s=tu-postout21 header.b=SByKgt8I DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0F0A14BA2E21 Received: from lxmhs51.srv.lrz.de (localhost [127.0.0.1]) by postout1.mail.lrz.de (Postfix) with ESMTP id 4gXn7x0k2XzyV0; Sat, 6 Jun 2026 20:36:29 +0200 (CEST) Authentication-Results: postout.lrz.de (amavis); dkim=pass (2048-bit key) reason="pass (just generated, assumed good)" header.d=tum.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tum.de; h= in-reply-to:references:x-mailer:from:from:subject:subject :message-id:date:date:content-type:content-type:mime-version :received:received; s=tu-postout21; t=1780770988; bh=MSIZ+W2QpSl lGPz+S2gWN5hat0JKT/qnl6NR+odq09M=; b=SByKgt8IEbyb9Dp+o8TPse2rE9n sNGnvJCqwlegaTi47jSFB9mL6HvlQ5RbtTXdBulMPZKgsjpQqDm5bq2sVIOxbjAD 8IOcjCCUJzJplmP+tavex6igQWs7YIJKCG5uGWocLJoSj3FDUon5cqQMC2PTA8cN QPP0MxEFdQTPhp9PKBwdAZy6DWF5vcWM3KwQwyvXYFtz3Wzv60PxrzidXMsHJILo sR44AJAAjHNHLvhbgMo4z2X3+0mNNCfBEjCfJtbguMp3JVv96oulUCmQ0kuHCugW JQk0I7Uh6vBFGRPpmouYjxOrn9nurA9Ikw6z2cjXz/1BjLnvA5jrTN3ASsQ== X-Virus-Scanned: by amavisd-new at lrz.de in lxmhs51.srv.lrz.de X-Spam-Score: -2.872 X-Spam-Level: X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, PROLO_LEO1, RCVD_IN_DNSWL_LOW, SPF_HELO_PASS, SPF_PASS, TXREP, URIBL_BLOCKED shortcircuit=no autolearn=ham autolearn_force=no version=3.4.6 Received: from postout1.mail.lrz.de ([127.0.0.1]) by lxmhs51.srv.lrz.de (lxmhs51.srv.lrz.de [127.0.0.1]) (amavis, port 20024) with LMTP id 1A0ZZ1e-enrf; Sat, 6 Jun 2026 20:36:28 +0200 (CEST) Received: from localhost (unknown [IPv6:2001:a61:3014:e401:b840:8cd3:fd9b:a31e]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by postout1.mail.lrz.de (Postfix) with ESMTPSA id 4gXn7q5wkpzyTM; Sat, 6 Jun 2026 20:36:23 +0200 (CEST) Mime-Version: 1.0 Date: Sat, 06 Jun 2026 20:36:23 +0200 Message-Id: Subject: [PATCH v2] rtld: cache cpuid results on the stack for intel Cc: To: "Sunil Pandey" From: "Fabian Rast" X-Mailer: aerc 0.21.0-0-g5549850facc2 References: In-Reply-To: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org dl_init_cacheinfo retrieves various information about cache sizes, using the cpuid instruction on x86. Previously, the same cpuid leaves were queried multiple times. This behavior caused intel_check_word to prominently show up in profiles of dynamic loader startup on the Intel(R) Xeon(R) Gold 6430. The big performance impact could not be reproduced on other Intel cpus. This patch reduces the number of cpuid queries on startup by caching their results on the stack for reuse when searching for a different cache size value. This approach does not change the overall design of the cache enumeration code (repeated calls to handle_* functions). The values are cached on the stack instead of globally (e.g. in the cpu_features global) because they are never needed after early initialization. The cache is only active for Intel cpus, because it has not yet been shown through benchmarks that it meaningfully improves performance for other processors. Signed-off-by: Fabian Rast Reviewed-by: Sunil K Pandey --- sysdeps/x86/dl-cacheinfo.h | 120 ++++++++++++++++++++++--------------- 1 file changed, 73 insertions(+), 47 deletions(-) diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index 84817a84fd..53c9102bca 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -108,6 +108,14 @@ static const struct intel_02_cache_info #define nintel_02_known (sizeof (intel_02_known) / sizeof (intel_02_known [0])) +/* Cache for redundant cpuid queries in handle_intel, intel_check_word and + get_common_cache_info. Currently, this has only been shown to significantly + improve performance on a specific Intel CPU (Xeon 6430). */ +struct intel_cpuid_cache { + char leaf2_valid, leaf4_valid; /* Number of cached (sub)leaves. */ + unsigned int leaf2[4], leaf4[0x10][4]; +}; + static int intel_02_known_compare (const void *p1, const void *p2) { @@ -128,7 +136,8 @@ static long int __attribute__ ((noinline)) intel_check_word (int name, unsigned int value, bool *has_level_2, bool *no_level_2_or_3, - const struct cpu_features *cpu_features) + const struct cpu_features *cpu_features, + struct intel_cpuid_cache *cache) { if ((value & 0x80000000) != 0) /* The register value is reserved. */ @@ -162,7 +171,21 @@ intel_check_word (int name, unsigned int value, bool *has_level_2, unsigned int round = 0; while (1) { - __cpuid_count (4, round, eax, ebx, ecx, edx); + if (round < cache->leaf4_valid) + /* Subleaf was queried before. Do not execute cpuid again. */ + eax = cache->leaf4[round][0], ebx = cache->leaf4[round][1], + ecx = cache->leaf4[round][2], edx = cache->leaf4[round][3]; + else if (round == cache->leaf4_valid + && round < sizeof(cache->leaf4)/sizeof(*cache->leaf4)) + { + /* Cache the cpuid result if we have space. */ + __cpuid_count (4, round, eax, ebx, ecx, edx); + cache->leaf4[round][0] = eax, cache->leaf4[round][1] = ebx; + cache->leaf4[round][2] = ecx, cache->leaf4[round][3] = edx; + cache->leaf4_valid++; + } + else + __cpuid_count (4, round, eax, ebx, ecx, edx); enum { null = 0, data = 1, inst = 2, uni = 3 } type = eax & 0x1f; if (type == null) @@ -258,7 +281,8 @@ intel_check_word (int name, unsigned int value, bool *has_level_2, static long int __attribute__ ((noinline)) -handle_intel (int name, const struct cpu_features *cpu_features) +handle_intel (int name, const struct cpu_features *cpu_features, + struct intel_cpuid_cache *cache) { unsigned int maxidx = cpu_features->basic.max_cpuid; @@ -271,41 +295,33 @@ handle_intel (int name, const struct cpu_features *cpu_features) long int result = 0; bool no_level_2_or_3 = false; bool has_level_2 = false; - unsigned int eax; - unsigned int ebx; - unsigned int ecx; - unsigned int edx; - __cpuid (2, eax, ebx, ecx, edx); + int i; + + if (!cache->leaf2_valid) + { + __cpuid (2, cache->leaf2[0], cache->leaf2[1], + cache->leaf2[2], cache->leaf2[3]); + cache->leaf2_valid = 1; + } /* The low byte of EAX of CPUID leaf 2 should always return 1 and it should be ignored. If it isn't 1, use CPUID leaf 4 instead. */ - if ((eax & 0xff) != 1) + if ((cache->leaf2[0] & 0xff) != 1) return intel_check_word (name, 0xff, &has_level_2, &no_level_2_or_3, - cpu_features); - else - { - eax &= 0xffffff00; + cpu_features, cache); - /* Process the individual registers' value. */ - result = intel_check_word (name, eax, &has_level_2, - &no_level_2_or_3, cpu_features); - if (result != 0) - return result; - - result = intel_check_word (name, ebx, &has_level_2, - &no_level_2_or_3, cpu_features); - if (result != 0) - return result; + /* Process all descriptors in leaf 2. */ + result = intel_check_word (name, cache->leaf2[0]&0xffffff00, &has_level_2, + &no_level_2_or_3, cpu_features, cache); + if (result != 0) + return result; - result = intel_check_word (name, ecx, &has_level_2, - &no_level_2_or_3, cpu_features); - if (result != 0) - return result; - - result = intel_check_word (name, edx, &has_level_2, - &no_level_2_or_3, cpu_features); + for (i = 1; i < 4; i++) + { + result = intel_check_word (name, cache->leaf2[i], &has_level_2, + &no_level_2_or_3, cpu_features, cache); if (result != 0) - return result; + return result; } if (name >= _SC_LEVEL2_CACHE_SIZE && name <= _SC_LEVEL3_CACHE_LINESIZE @@ -779,7 +795,7 @@ handle_hygon (int name) static void get_common_cache_info (long int *shared_ptr, long int * shared_per_thread_ptr, unsigned int *threads_ptr, - long int core) + long int core, struct intel_cpuid_cache *cache) { unsigned int eax; unsigned int ebx; @@ -837,7 +853,14 @@ get_common_cache_info (long int *shared_ptr, long int * shared_per_thread_ptr, u int check = 0x1 | (threads_l3 == 0) << 1; do { - __cpuid_count (4, i++, eax, ebx, ecx, edx); + if (cache && i < cache->leaf4_valid) + eax = cache->leaf4[i][0], ebx = cache->leaf4[i][1], + ecx = cache->leaf4[i][2], edx = cache->leaf4[i][3]; + else + /* Do not attempt to cache queries at this point, + because get_common_cache_info is called last. */ + __cpuid_count (4, i, eax, ebx, ecx, edx); + i++; /* There seems to be a bug in at least some Pentium Ds which sometimes fail to iterate all cache parameters. @@ -1017,35 +1040,38 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) if (cpu_features->basic.kind == arch_kind_intel) { - data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features); - shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features); + struct intel_cpuid_cache cache; + cache.leaf2_valid = cache.leaf4_valid = 0; + + data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features, &cache); + shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features, &cache); shared_per_thread = shared; level1_icache_size - = handle_intel (_SC_LEVEL1_ICACHE_SIZE, cpu_features); + = handle_intel (_SC_LEVEL1_ICACHE_SIZE, cpu_features, &cache); level1_icache_linesize - = handle_intel (_SC_LEVEL1_ICACHE_LINESIZE, cpu_features); + = handle_intel (_SC_LEVEL1_ICACHE_LINESIZE, cpu_features, &cache); level1_dcache_size = data; level1_dcache_assoc - = handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features); + = handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features, &cache); level1_dcache_linesize - = handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features); + = handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features, &cache); level2_cache_size - = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features); + = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features, &cache); level2_cache_assoc - = handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features); + = handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features, &cache); level2_cache_linesize - = handle_intel (_SC_LEVEL2_CACHE_LINESIZE, cpu_features); + = handle_intel (_SC_LEVEL2_CACHE_LINESIZE, cpu_features, &cache); level3_cache_size = shared; level3_cache_assoc - = handle_intel (_SC_LEVEL3_CACHE_ASSOC, cpu_features); + = handle_intel (_SC_LEVEL3_CACHE_ASSOC, cpu_features, &cache); level3_cache_linesize - = handle_intel (_SC_LEVEL3_CACHE_LINESIZE, cpu_features); + = handle_intel (_SC_LEVEL3_CACHE_LINESIZE, cpu_features, &cache); level4_cache_size - = handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features); + = handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features, &cache); get_common_cache_info (&shared, &shared_per_thread, &threads, - level2_cache_size); + level2_cache_size, &cache); } else if (cpu_features->basic.kind == arch_kind_zhaoxin) { @@ -1066,7 +1092,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) level3_cache_linesize = handle_zhaoxin (_SC_LEVEL3_CACHE_LINESIZE); get_common_cache_info (&shared, &shared_per_thread, &threads, - level2_cache_size); + level2_cache_size, NULL); } else if (cpu_features->basic.kind == arch_kind_amd) {