From patchwork Mon Jul 17 20:10:43 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 72810 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 2AF9F385840E for ; Mon, 17 Jul 2023 20:11:14 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2AF9F385840E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1689624674; bh=bVrArga2WROybmctojyEAMgWjYemJJO4lhxciLU+yZk=; h=To:Cc:Subject:Date:In-Reply-To:References:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=btV1TwYhRgTAr5VA4NSs4D/WbHjwbAcRSVy9FSqJVX7XR8ub0LW45nwJwS86lkYoR ZWZLEIfqhR5dafTo6XpG/govzC4GxyYl69ZKHsNmzJHOxwNa8W8b6SElDaKdJMJMxa Vpbx707674poBfKUllB3UBvMqRg+5l+UbETMYt+o= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pl1-x62b.google.com (mail-pl1-x62b.google.com [IPv6:2607:f8b0:4864:20::62b]) by sourceware.org (Postfix) with ESMTPS id 6E5073858D28 for ; Mon, 17 Jul 2023 20:10:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 6E5073858D28 Received: by mail-pl1-x62b.google.com with SMTP id d9443c01a7336-1b8b4749013so38853855ad.2 for ; Mon, 17 Jul 2023 13:10:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689624650; x=1692216650; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bVrArga2WROybmctojyEAMgWjYemJJO4lhxciLU+yZk=; b=JYls6qLaOu8BVh34b8XdUU9PR2wNelkYnnPZ3R3INKFqcW9gtsIJgXzZ/HmPSbXZ9B M3DKwNFgxkZD1Gfbbi6vhJG5zQaijtXRvbmzHVlvHDk7LRn57h6BsLeyhhqL0JOXtWDd jOGLfzKRqFRsiZa2JBqz9hRselDHGMhVGvrrvoqSo0GIno0+ChocEw/Po/ohTQLlo30A nXzBH9mtSKJb0rcHOeT/ECYeTp9jm4TavKlpSNB+l08KAe1ARW6gknhORvg5KyQmZE3c FHj0AhTwxkT2y9MQY6taTgDcGTDkgUqzXAh5/W6yBPE/FdgTmEldIxHoHaykv9hvNE49 oLkw== X-Gm-Message-State: ABy/qLZP/Jael5ro89mzVH+PceGkgmtYoNC6+nB5kVzTFsqWnwHxGm80 IuazD6lydWsLkWnUjLF1rfO50YjxXVdIHw== X-Google-Smtp-Source: APBJJlE4a9/YBD6tLMecpEUv5T+IuEDFElGTZR+y5lJ/jqIoUZRs6ujiUcPIvya24hjLrTTLeIsSng== X-Received: by 2002:a17:902:f7cf:b0:1b8:a88c:4dc6 with SMTP id h15-20020a170902f7cf00b001b8a88c4dc6mr11959651plw.45.1689624649672; Mon, 17 Jul 2023 13:10:49 -0700 (PDT) Received: from noahgold-DESK.. ([192.55.60.41]) by smtp.gmail.com with ESMTPSA id jb3-20020a170903258300b001b9df74ba5asm257893plb.210.2023.07.17.13.10.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Jul 2023 13:10:48 -0700 (PDT) To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org Subject: [PATCH v2] x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold. Date: Mon, 17 Jul 2023 15:10:43 -0500 Message-Id: <20230717201043.105528-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230714151459.3357038-1-goldstein.w.n@gmail.com> References: <20230714151459.3357038-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" On some machines we end up with incomplete cache information. This can make the new calculation of `sizeof(total-L3)/custom-divisor` end up lower than intended (and lower than the prior value). So reintroduce the old bound as a lower bound to avoid potentially regressing code where we don't have complete information to make the decision. --- sysdeps/x86/dl-cacheinfo.h | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index c98fa57a7b..cd4d0351ae 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -614,8 +614,8 @@ get_common_cache_info (long int *shared_ptr, long int * shared_per_thread_ptr, u /* Account for non-inclusive L2 and L3 caches. */ if (!inclusive_cache) { - if (threads_l2 > 0) - shared_per_thread += core / threads_l2; + long int core_per_thread = threads_l2 > 0 ? (core / threads_l2) : core; + shared_per_thread += core_per_thread; shared += core; } @@ -745,8 +745,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) /* The default setting for the non_temporal threshold is [1/8, 1/2] of size of the chip's cache (depending on `cachesize_non_temporal_divisor` which - is microarch specific. The default is 1/4). For most Intel and AMD - processors with an initial release date between 2017 and 2023, a thread's + is microarch specific. The default is 1/4). For most Intel processors + with an initial release date between 2017 and 2023, a thread's typical share of the cache is from 18-64MB. Using a reasonable size fraction of L3 is meant to estimate the point where non-temporal stores begin out-competing REP MOVSB. As well the point where the fact that @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) the maximum thrashing capped at 1/associativity. */ unsigned long int non_temporal_threshold = shared / cachesize_non_temporal_divisor; + + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most + likely have incorrect/incomplete cache info in which case, default to + 3/4 * per-thread L3 to avoid regressions. */ + unsigned long int non_temporal_threshold_lowbound + = shared_per_thread * 3 / 4; + if (non_temporal_threshold < non_temporal_threshold_lowbound) + non_temporal_threshold = non_temporal_threshold_lowbound; + /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run a higher risk of actually thrashing the cache as they don't have a HW LRU hint. As well, their performance in highly parallel situations is noticeably worse. */ if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) - non_temporal_threshold = shared_per_thread * 3 / 4; + non_temporal_threshold = non_temporal_threshold_lowbound; /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best if that operation cannot overflow. Minimum of 0x4040 (16448) because the