Abstract:
Systems, apparatuses, and methods for implementing mechanisms to improve data locality for distributed processing units are disclosed. A system includes a plurality of distributed processing units (e.g., GPUs) and memory devices. Each processing unit is coupled to one or more local memory devices. The system determines how to partition a workload into a plurality of workgroups based on maximizing data locality and data sharing. The system determines which subset of the plurality of workgroups to dispatch to each processing unit of the plurality of processing units based on maximizing local memory accesses and minimizing remote memory accesses. The system also determines how to partition data buffer(s) based on data sharing patterns of the workgroups. The system maps to each processing unit a separate portion of the data buffer(s) so as to maximize local memory accesses and minimize remote memory accesses.
Abstract:
Cluster manager functional blocks perform operations for migrating pages in portions in corresponding migration clusters. During operation, each cluster manager keeps an access record that includes information indicating accesses of pages in the portions in the corresponding migration cluster. Based on the access record and one or more migration policies, each cluster manager migrates pages between the portions in the corresponding migration cluster.
Abstract:
An integrated circuit (IC) package includes a stacked-die memory device. The stacked-die memory device includes a set of one or more stacked memory dies implementing memory cell circuitry. The stacked-die memory device further includes a set of one or more logic dies electrically coupled to the memory cell circuitry. The set of one or more logic dies includes a query controller and a memory controller. The memory controller is coupleable to at least one device external to the stacked-die memory device. The query controller is to perform a query operation on data stored in the memory cell circuitry responsive to a query command received from the external device.
Abstract:
Methods and apparatus obtain one or more system page table entries that represent virtual system (e.g., memory) page to physical system page translations. A number of the obtained system page table entries that can be encoded in each of a plurality of translation lookaside buffer (TLB) entry encoding formats are determined. The method and apparatus may select one of the TLB entry encoding formats that encode a number of the obtained system page table entries. The method and apparatus may encode a number of obtained system page table entries in the TLB entry encoding format selected into a compressed encoding format TLB entry. The method and apparatus may associate the compressed encoding format TLB entry with an encoding format indication of the encoding format selected. The method and apparatus may decode a compressed encoding format TLB entry based on a determined TLB entry encoding format.
Abstract:
A technique for source-side memory request network admission control includes adjusting, by a first node, a rate of injection of memory requests by the first node into a network coupled to a memory system. The adjusting is based on an injection policy for the first node and memory request efficiency indicators. The method may include injecting memory requests by the first node into the network coupled to the memory system. The injecting has the rate of injection. The technique includes adjusting the rate of injection by the first node. The first node adjusts the rate of injection according to an injection policy for the first node and memory request efficiency indicators. The injection policy may be based on an injection rate limit for the first node. The injection policy for the first node may be based on an injection rate limit per memory channel for the first node. The technique may include determining the memory request efficiency indicators based on comparisons of target addresses of the memory requests to addresses of recent memory requests of the first node.
Abstract:
A system includes a first memory and a device coupleable to the first memory. The device includes a second memory to cache data from the first memory. The second memory includes a plurality of rows, each row including a corresponding set of compressed data blocks of non-uniform sizes and a corresponding set of tag blocks. Each tag block represents a corresponding compressed data block of the row. The device further includes decompression logic to decompress data blocks accessed from the second memory. The device further includes compression logic to compress data blocks to be stored in the second memory.
Abstract:
An integrated circuit (IC) package includes a stacked-die memory device. The stacked-die memory device includes a set of one or more stacked memory dies implementing memory cell circuitry. The stacked-die memory device further includes a set of one or more logic dies electrically coupled to the memory cell circuitry. The set of one or more logic dies includes a query controller and a memory controller. The memory controller is coupleable to at least one device external to the stacked-die memory device. The query controller is to perform a query operation on data stored in the memory cell circuitry responsive to a query command received from the external device.
Abstract:
A system includes a plurality of memory classes and a set of one or more processing units coupled to the plurality of memory classes. The system further includes a data migration controller to select a traffic rate as a maximum traffic rate for transferring data between the plurality of memory classes based on a net benefit metric associated with the traffic rate, and to enforce the maximum traffic rate for transferring data between the plurality of memory classes.
Abstract:
A processing system having a multilevel cache hierarchy employs techniques for repurposing dead cache blocks so as to use otherwise wasted space in a cache hierarchy employing a write-back scheme. For a cache line containing invalid data with a valid tag, the valid tag is maintained for cache coherence purposes or otherwise, resulting in a valid tag for a dead cache block. A cache controller repurposes the dead cache block by storing any of a variety of new data at the dead cache block, while storing the new tag in a tag entry of a dead block tag way with an identifier indicating the location of the new data.
Abstract:
The described embodiments include a cache controller with a prediction mechanism in a cache. In the described embodiments, the prediction mechanism is configured to perform a lookup in each table in a hierarchy of lookup tables in parallel to determine if a memory request is predicted to be a hit in the cache, each table in the hierarchy comprising predictions whether memory requests to corresponding regions of a main memory will hit the cache, the corresponding regions of the main memory being smaller for tables lower in the hierarchy.