Abstract:
A cooling system is provided for a 3D integrated circuit (IC) to deliver fluid in x, y, and z dimensions to interior regions of the IC as a means to regulate heat. An IC includes a microfluidic network of channels, at least one sensor and at least one microelectromechanical system (MEMS)-based device that is disposed within the network of channels and that is configured to regulate a flow of fluid within the network of channels. Each sensor monitors a state of the IC. Each MEMS-based device receives control signals based on a state of the IC and regulates a flow of fluid within the network of channels based on control signals that area received on a real-time basis based on changes detected in a state of the IC.
Abstract:
A processor employs a hierarchical register file for a graphics processing unit (GPU). A top level of the hierarchical register file is stored at a local memory of the GPU (e.g., a memory on the same integrated circuit die as the GPU). Lower levels of the hierarchical register file are stored at a different, larger memory, such as a remote memory located on a different die than the GPU. A register file control module monitors the status of in-flight wavefronts at the GPU, and in particular whether each in-flight wavefront is active, predicted to be become active, or inactive. The register file control module places execution data for active and predicted-active wavefronts in the top level of the hierarchical register file and places execution data for inactive wavefronts at lower levels of the hierarchical register file.
Abstract:
A processing system includes a processing module having a first interface coupleable to an interconnect. The first interface includes a first cryptologic engine to encrypt a representation of store data of a store operation and a memory address using a first key and a first feedback-based cryptologic process to generate first encrypted data and an encrypted memory address. A memory module includes a second interface coupled to the interconnect. The second interface includes a second cryptologic engine to decrypt the first encrypted data and the encrypted memory address using a second key and a second feedback-based cryptologic process to generate a copy of the representation of the store data and a copy of the memory address. The second interface further is to store the copy of the representation of the store data to a memory location of the memory core based on the copy of the memory address.
Abstract:
A memory-to-memory copy operation control system includes a processor configured to receive an instruction to perform a memory-to-memory copy operation and a memory module network in communication with the processor. The memory module network has a plurality of memory modules that include a proximal memory module in direct communication with the processor and one or more additional memory modules in communication with the processor via the proximal memory module. The system also includes a memory controller in communication with the processor and the network of memory modules. The processor is configured to issue a first command causing data to be copied from a first memory module to a second memory module without sending the data to the processor or the memory controller.
Abstract:
A data processing device is provided that employs multiple translation look-aside buffers (TLBs) associated with respective processors that are configured to store selected address translations of a page table of a memory shared by the processors. The processing device is configured such that when an address translation is requested by a processor and is not found in the TLB associated with that processor, another TLB is probed for the requested address translation. The probe across to the other TLB may occur in advance of a walk of the page table for the requested address or alternatively a walk can be initiated concurrently with the probe. Where the probe successfully finds the requested address translation, the page table walk can be avoided or discontinued.
Abstract:
An Address Mapping-Aware Tasking (AMAT) mechanism manages compute task data and issues compute tasks on behalf of threads that created the compute task data. The AMAT mechanism stores compute task data generated by host threads in a set of partitions, where each partition is designated for a particular memory module. The AMAT mechanism maintains address mapping data that maps address information to partitions. Threads push compute task data to the AMAT mechanism instead of generating and issuing their own compute tasks. The AMAT mechanism uses address information included in the compute task data and the address mapping data to determine partitions in which to store the compute task data. The AMAT mechanism then issues compute tasks to be executed near the corresponding memory modules (i.e., in PIM execution units or NUMA compute nodes) based upon the compute task data stored in the partitions.
Abstract:
A processing system includes a scheduling mechanism for producing data for fine-grained reordering of workgroups of a kernel to produce blocks of data, such as for communication across devices to enable overlapping of a producer computation with an all-reduce communication across the network. This scheduling mechanism enables a first parallel processor to schedule and execute a set of workgroups of a producer operation to generate data for transmission to a second parallel processor in a desired traffic pattern. At the same time, the second parallel processor schedules and executes a different set of workgroups of the producer operation to generate data for transmission in a desired traffic pattern to a third parallel processor or back to the first parallel processor.
Abstract:
Memory allocation for processing-in-memory operations, including: receiving, by an allocation module, a memory allocation request indicating a plurality of data structure operands for a processing-in-memory operation; determining a memory allocation pattern for the plurality of data structure operands, wherein the memory allocation pattern interleaves a plurality of component pages of a memory page across the plurality of data structure operands; and allocating the memory page based on the determined memory allocation pattern.
Abstract:
A memory module includes one or more programmable ECC engines that may be programed by a host processing element with a particular ECC implementation. As used herein, the term “ECC implementation” refers to ECC functionality for performing error detection and subsequent processing, for example using the results of the error detection to perform error correction and to encode corrupted data that cannot be corrected, etc. The approach allows an SoC designer or company to program and reprogram ECC engines in memory modules in a secure manner without having to disclose the particular ECC implementations used by the ECC engines to memory vendors or third parties.
Abstract:
A data processing platform, method, and program product perform compression and decompression of a set of data items. Suffix data and a prefix are selected for each respective data item in the set of data items based on data content of the respective data item. The set of data items is sorted based on the prefixes. The prefixes are encoded by querying multiple encoding tables to create a code word containing compressed information representing values of all prefixes for the set of data items. The code word and suffix data for each of the data items are stored in memory. The code word is decompressed to recover the prefixes. The recovered prefixes are paired with their respective suffix data.