Abstract:
An integrated circuit (IC) package includes a stacked-die memory device. The stacked-die memory device includes a set of one or more stacked memory dies implementing memory cell circuitry. The stacked-die memory device further includes a set of one or more logic dies electrically coupled to the memory cell circuitry. The set of one or more logic dies includes a query controller and a memory controller. The memory controller is coupleable to at least one device external to the stacked-die memory device. The query controller is to perform a query operation on data stored in the memory cell circuitry responsive to a query command received from the external device.
Abstract:
An integrated circuit (IC) package includes a stacked-die memory device. The stacked-die memory device includes a set of one or more stacked memory dies implementing memory cell circuitry. The stacked-die memory device further includes a set of one or more logic dies electrically coupled to the memory cell circuitry. The set of one or more logic dies includes a query controller and a memory controller. The memory controller is coupleable to at least one device external to the stacked-die memory device. The query controller is to perform a query operation on data stored in the memory cell circuitry responsive to a query command received from the external device.
Abstract:
Durations of active performance states of components of a processing system can be predicted based on one or more previous durations of an active state of the components. One or more entities in the processing system such as processor cores or caches can be configured based on the predicted durations of the active state of the components. Some embodiments configure a first component in a processing system based on a predicted duration of an active state of a second component of the processing system. The predicted duration is predicted based on one or more previous durations of an active state of the second component.
Abstract:
Some die-stacked memories will contain a logic layer in addition to one or more layers of DRAM (or other memory technology). This logic layer may be a discrete logic die or logic on a silicon interposer associated with a stack of memory dies. Additional circuitry/functionality is placed on the logic layer to implement functionality to perform various computation operations. This functionality would be desired where performing the operations locally near the memory devices would allow increased performance and/or power efficiency by avoiding transmission of data across the interface to the host processor.
Abstract:
A system includes an atomic processing engine (APE) coupled to an interconnect. The interconnect is to couple to one or more processor cores. The APE receives a plurality of commands from the one or more processor cores through the interconnect. In response to a first command, the APE performs a first plurality of operations associated with the first command. The first plurality of operations references multiple memory locations, at least one of which is shared between two or more threads executed by the one or more processor cores.
Abstract:
Embodiments include methods, systems and non-transitory computer-readable computer readable media including instructions for executing a prefetch kernel that includes memory accesses for prefetching data for a processing kernel into a memory, and, subsequent to executing at least a portion of the prefetch kernel, executing the processing kernel where the processing kernel includes accesses to data that is stored into the memory resulting from execution of the prefetch kernel.
Abstract:
Embodiments include methods, systems and non-transitory computer-readable computer readable media including instructions for executing a prefetch kernel with reduced intermediate state storage resource requirements. These include executing a prefetch kernel on a graphics processing unit (GPU), such that the prefetch kernel begins executing before a processing kernel. The prefetch kernel performs memory operations that are based upon at least a subset of memory operations in the processing kernel.
Abstract:
An integrated circuit device includes a memory controller coupleable to a memory. The memory controller to schedule memory accesses to regions of the memory based on memory timing parameters specific to the regions. A method includes receiving a memory access request at a memory device. The method further includes accessing, from a timing data store of the memory device, data representing a memory timing parameter specific to a region of the memory cell circuitry targeted by the memory access request. The method also includes scheduling, at the memory controller, the memory access request based on the data.
Abstract:
Systems, apparatuses, and methods for migrating execution contexts are disclosed. A system includes a plurality of processing units and memory devices. The system is configured to execute any number of software applications. The system is configured to detect, within a first software application, a primitive for migrating at least a portion of the execution context of a source processing unit to a target processing unit, wherein the primitive includes one or more instructions. The execution context includes a plurality of registers. A first processing unit is configured to execute the one or more instructions of the primitive to cause a portion of an execution context of the first processing unit to be migrated to a second processing unit. In one embodiment, executing the primitive instruction(s) causes an instruction pointer value, with an optional offset value, to be sent to the second processing unit.
Abstract:
Systems, apparatuses, and methods for implementing efficient queues and other data structures. A queue may be shared among multiple processors and/or threads without using explicit software atomic instructions to coordinate access to the queue. System software may allocate an atomic queue and corresponding queue metadata in system memory and return, to the requesting thread, a handle referencing the queue metadata. Any number of threads may utilize the handle for accessing the atomic queue. The logic for ensuring the atomicity of accesses to the atomic queue may reside in a management unit in the memory controller coupled to the memory where the atomic queue is allocated.