Abstract:
A processing system includes a plurality of processor cores formed in a first layer of an integrated circuit device and a plurality of partitions of memory formed in one or more second layers of the integrated circuit device. The one or more second layers are deployed in a stacked configuration with the first layer. Each of the partitions is associated with a subset of the processor cores that have overlapping footprints with the partitions. The processing system also includes first memory paths between the processor cores and their corresponding subsets of partitions. The processing system further includes second memory paths between the processor cores and the partitions.
Abstract:
A processor includes multiple processing units (e.g., processor cores), with each processing unit associated with at least one private, dedicated cache. The processor is also associated with a system memory that stores all data that can be accessed by the multiple processing units. A coherency manager (e.g., a coherence directory) of the processor enforces a specified coherency scheme to ensure data coherency between the different caches and between the caches and the system memory. In response to a memory access request to a given cache resulting in a cache miss, the coherency manager identifies the current access latency to the system memory as well as the current access latencies to other caches of the processor. The coherency manager transfers the targeted data to the given cache from the cache or system memory having the lower access latency.
Abstract:
A processing system selectively allocates space to store a group of one or more cache lines at a cache level of a cache hierarchy having a plurality of cache levels based on memory access patterns of a software application executing at the processing system. The processing system generates bit vectors indicating which cache levels are to allocate space to store groups of one or more cache lines based on the memory access patterns, which are derived from data granularity and movement information. Based on the bit vectors, the processing system provides hints to the cache hierarchy indicating the lowest cache level that can exploit the reuse potential for a particular data.
Abstract:
A processor partitions a coherency directory into different regions for different processor cores and manages the number of entries allocated to each region based at least in part on monitored recall costs indicating expected resource costs for reallocating entries. Examples of monitored recall costs include a number of cache evictions associated with entry reallocation, the hit rate of each region of the coherency directory, and the like, or a combination thereof. By managing the entries allocated to each region based on the monitored recall costs, the processor ensures that processor cores associated with denser memory access patterns (that is, memory access patterns that more frequently access cache lines associated with the same memory pages) are assigned more entries of the coherency directory.
Abstract:
A device receives an indication that a memory bank is to be powered down, and determines, based on receiving the indication, shutdown scores corresponding to powered up memory banks. Each shutdown score is based on a shutdown metric associated with powering down a powered up memory bank. The device may power down a selected memory bank based on the shutdown scores.
Abstract:
A processor distributes memory timing parameters and data among different memory modules based upon memory access patterns. The memory access patterns indicate different types, or classes, of data for an executing workload, with each class associated with different memory access characteristics, such as different row buffer hit rate levels, different frequencies of access, different criticalities, and the like. The processor assigns each memory module to a data class and sets the memory timing parameters for each memory module according to the module's assigned data class, thereby tailoring the memory timing parameters for efficient access of the corresponding data.
Abstract:
A cache of a processor includes a cache controller to implement a cache management policy for the insertion and replacement of cache lines of the cache. The cache management policy assigns replacement priority levels to each cache line of at least a subset of cache lines in a region of the cache based on a comparison of a number of accesses to a cache set having a way that stores a cache line since the cache line was last accessed to a reuse distance determined for the region of the cache, wherein the reuse distance represents an average number of accesses to a given cache set of the region between accesses to any given cache line of the cache set.
Abstract:
A first processor is configured to detect migration of a page from a second memory associated with a second processor to a first memory associated with the first processor or to detect duplication of the page in the first memory and the second memory. The first processor implements a translation lookaside buffer (TLB) and the first processor is configured to insert an entry in the TLB in response to the duplication or the migration of the page. The entry maps a virtual address of the page to a physical address in the first memory and the entry is inserted into the TLB without modifying a corresponding entry in a page table that maps the virtual address of the page to a physical address in the second memory. In some cases, a duplicate translation table (DTT) stores a copy of the entry that is accessed in response to a TLB miss.
Abstract:
A processor distributes memory timing parameters and data among different memory modules based upon memory access patterns. The memory access patterns indicate different types, or classes, of data for an executing workload, with each class associated with different memory access characteristics, such as different row buffer hit rate levels, different frequencies of access, different criticalities, and the like. The processor assigns each memory module to a data class and sets the memory timing parameters for each memory module according to the module’s assigned data class, thereby tailoring the memory timing parameters for efficient access of the corresponding data.
Abstract:
A processor sets memory timing parameters based on a profile of a workload to be executed at the processor and based on a thermal budget associated with the processor. For a given workload and amount of available thermal headroom, as indicated by a detected temperature, the processor adjusts one or more of the memory timing parameters according to the workload profile. The processor is thereby able to tailor the memory timing parameters according to the memory access behavior of the workload, improving overall processing efficiency.