Abstract:
A distributed shared-memory system includes several nodes that each have one or more processor cores, caches, local main memory, and a directory. Each node further includes predictors that use historical memory access information to predict future coherence permission requirements and speculatively initiate coherence operations. In one embodiment, predictors are included at processor cores for monitoring a memory access stream (e.g., historical sequence of memory addresses referenced by a processor core) and predicting addresses of future accesses. In another embodiment, predictors are included at the directory of each node for monitoring memory access traffic and coherence-related activities for individual cache lines to predict future demands for particular cache lines. In other embodiments, predictors are included at both the processor cores and directory of each node. Predictions from the predictors are used to initiate coherence operations to speculatively request promotion or demotion of coherence permissions.
Abstract:
A data processing system includes a plurality of processors, local memories associated with a corresponding processor, and at least one inter-processor link In response to a first processor performing a load or store operation on an address of a corresponding local memory that is not currently in the local cache, a local cache allocates a first cache line and encodes a local state with the first cache line. In response to a load operation from an address of a remote memory that is not currently in the local cache, the local cache allocates a second cache line and encodes a remote state with the second cache line. The first processor performs subsequent loads and stores on the first cache line in the local cache in response to the local state, and subsequent loads from the second cache line in the local cache in response to the remote state.
Abstract:
A cache coherence bridge protocol provides an interface between a cache coherence protocol of a host processor and a cache coherence protocol of a processor-in-memory, thereby decoupling coherence mechanisms of the host processor and the processor-in-memory. The cache coherence bridge protocol requires limited change to existing host processor cache coherence protocols. The cache coherence bridge protocol may be used to facilitate interoperability between host processors and processor-in-memory devices designed by different vendors and both the host processors and processor-in-memory devices may implement coherence techniques among computing units within each processor. The cache coherence bridge protocol may support different granularity of cache coherence permissions than those used by cache coherence protocols of a host processor and/or a processor-in-memory. The cache coherence bridge protocol uses a shadow directory that maintains status information indicating an aggregate view of copies of data cached in a system external to a processor-in-memory containing that data.
Abstract:
A communication device includes a data source that generates data for transmission over a bus, and a data encoder that receives and encodes outgoing data. An encoder system receives outgoing data from a data source and stores the outgoing data in a first queue. An encoder encodes outgoing data with a header type that is based upon a header type indication from a controller and stores the encoded data that may be a packet or a data word with at least one layered header in a second queue for transmission. The device is configured to receive at a payload extractor, a packet protocol change command from the controller and to remove the encoded data and to re-encode the data to create a re-encoded data packet and placing the re-encoded data packet in the second queue for transmission.
Abstract:
A method of managing thermal levels in a memory system may include determining an expected thermal level associated with each of a plurality of locations in a memory structure, and for each operation of a plurality of operations addressed to the memory structure, assigning the operation to a target location of the plurality of physical locations in the memory structure based on a thermal penalty associated with the operation and the expected thermal level associated with the target location.
Abstract:
A data processing device is provided that employs multiple translation look-aside buffers (TLBs) associated with respective processors that are configured to store selected address translations of a page table of a memory shared by the processors. The processing device is configured such that when an address translation is requested by a processor and is not found in the TLB associated with that processor, another TLB is probed for the requested address translation. The probe across to the other TLB may occur in advance of a walk of the page table for the requested address or alternatively a walk can be initiated concurrently with the probe. Where the probe successfully finds the requested address translation, the page table walk can be avoided or discontinued.
Abstract:
A method and apparatus for inter-row data transfer in memory devices is described. Data transfer from one physical location in a memory device to another is achieved without engaging the external input/output pins on the memory device. In an example method, a memory device is responsive to a row transfer (RT) command which includes a source row identifier and a target row identifier. The memory device activates a source row and storing source row data in a row buffer, latches the target row identifier into the memory device, activates a word line of a target row to prepare for a write operation, and stores the source row data from the row buffer into the target row.
Abstract:
Near-memory compute elements perform memory operations and temporarily store at least a portion of address information for the memory operations in local storage. A broadcast memory command is then issued to the near-memory compute elements that causes the near-memory compute elements to perform a subsequent memory operation using their respective address information stored in the local storage. This allows a single broadcast memory command to be used to perform memory operations across multiple memory elements, such as DRAM banks, using bank-specific address information. In one implementation, the approach is used to process workloads with irregular updates to memory while consuming less command bus bandwidth than conventional approaches. Implementations include using conditional flags to selectively designate address information in local storage that is to be processed with the broadcast memory command.
Abstract:
An apparatus that manages multi-process execution in a processing-in-memory (“PIM”) device includes a gatekeeper configured to: receive an identification of one or more registered PIM processes; receive, from a process, a memory request that includes a PIM command; if the requesting process is a registered PIM process and another registered PIM process is active on the PIM device, perform a context switch of PIM state between the registered PIM processes; and issue the PIM command of the requesting process to the PIM device.
Abstract:
A method and apparatus perform a first hash operation on a first key wherein the first hash operation is biased to map the first key and associated value to a set of frequently-accessed buckets in a hash table. An entry for the first key and associated value is stored in the set of frequently-accessed buckets. A second hash operation is performed on a second key wherein the second hash operation is biased to map the second key and associated value to a set of less frequently-accessed buckets in the hash table. An entry for the second key and associated value is stored in the set of less frequently-accessed buckets. The method and apparatus perform a hash table look up of the requested key in the set of frequently-accessed buckets, if the requested key is not found, then a hash table lookup is performed in the set of less frequently-accessed buckets.