Abstract:
Disclosed embodiments relate to remote atomic operations (RAO) in multi-socket systems. In one example, a method, performed by a cache control circuit of a requester socket, includes: receiving the RAO instruction from the requester CPU core, determining a home agent in a home socket for the addressed cache line, providing a request for ownership (RFO) of the addressed cache line to the home agent, waiting for the home agent to either invalidate and retrieve a latest copy of the addressed cache line from a cache, or to fetch the addressed cache line from memory, receiving an acknowledgement and the addressed cache line, executing the RAO instruction on the received cache line atomically, subsequently receiving multiple local RAO instructions to the addressed cache line from one or more requester CPU cores, and executing the multiple local RAO instructions on the received cache line independently of the home agent.
Abstract:
Aspects of the embodiments are directed to systems and methods for providing and using hints in data packets to perform memory transaction optimization processes prior to receiving one or more data packets that rely on memory transactions. The systems and methods can include receiving, from a device connected to the root complex across a PCIe-compliant link, a data packet; identifying from the received device a memory transaction hint bit; determining a memory transaction from the memory transaction hint bit; and performing an optimization process based, at least in part, on the determined memory transaction.
Abstract:
Disclosed embodiments relate to remote atomic operations (RAO) in multi-socket systems. In one example, a method, performed by a cache control circuit of a requester socket, includes: receiving the RAO instruction from the requester CPU core, determining a home agent in a home socket for the addressed cache line, providing a request for ownership (RFO) of the addressed cache line to the home agent, waiting for the home agent to either invalidate and retrieve a latest copy of the addressed cache line from a cache, or to fetch the addressed cache line from memory, receiving an acknowledgement and the addressed cache line, executing the RAO instruction on the received cache line atomically, subsequently receiving multiple local RAO instructions to the addressed cache line from one or more requester CPU cores, and executing the multiple local RAO instructions on the received cache line independently of the home agent.
Abstract:
Embodiments may be generally direct to apparatuses, systems, method, and techniques to provide multi-interconnect protocol communication. In an embodiment, an apparatus for providing multi-interconnect protocol communication may include a component comprising at least one connector operative to connect the component to at least one off-package device via a standard interconnect protocol, and logic, at least a portion of the logic comprised in hardware, the logic to determine data to be communicated via a multi-interconnect protocol, provide the data to a multi-protocol multiplexer to determine a route for the data, route the data on-package responsive to the multi-protocol multiplexer indicating a multi-interconnect on-package mode, and route the data off-package via the at least one connector responsive to the multi-protocol multiplexer indicating a multi-interconnect off-package mode. Other embodiments are described.
Abstract:
Disclosed embodiments relate to spatial and temporal merging of remote atomic operations. In one example, a system includes an RAO instruction queue stored in a memory and having entries grouped by destination cache line, each entry to enqueue an RAO instruction including an opcode, a destination identifier, and source data, optimization circuitry to receive an incoming RAO instruction, scan the RAO instruction queue to detect a matching enqueued RAO instruction identifying a same destination cache line as the incoming RAO instruction, the optimization circuitry further to, responsive to no matching enqueued RAO instruction being detected, enqueue the incoming RAO instruction; and, responsive to a matching enqueued RAO instruction being detected, determine whether the incoming and matching RAO instructions have a same opcode to non-overlapping cache line elements, and, if so, spatially combine the incoming and matching RAO instructions by enqueuing both RAO instructions in a same group of cache line queue entries at different offsets.
Abstract:
Apparatus and methods implementing a hardware queue management device for reducing inter-core data transfer overhead by offloading request management and data coherency tasks from the CPU cores. The apparatus include multi-core processors, a shared L3 or last-level cache (“LLC”), and a hardware queue management device to receive, store, and process inter-core data transfer requests. The hardware queue management device further comprises a resource management system to control the rate in which the cores may submit requests to reduce core stalls and dropped requests. Additionally, software instructions are introduced to optimize communication between the cores and the queue management device.
Abstract:
Disclosed embodiments relate to remote atomic operations (RAO) in multi-socket systems. In one example, a method, performed by a cache control circuit of a requester socket, includes: receiving the RAO instruction from the requester CPU core, determining a home agent in a home socket for the addressed cache line, providing a request for ownership (RFO) of the addressed cache line to the home agent, waiting for the home agent to either invalidate and retrieve a latest copy of the addressed cache line from a cache, or to fetch the addressed cache line from memory, receiving an acknowledgement and the addressed cache line, executing the RAO instruction on the received cache line atomically, subsequently receiving multiple local RAO instructions to the addressed cache line from one or more requester CPU cores, and executing the multiple local RAO instructions on the received cache line independently of the home agent.
Abstract:
A processor includes a cache-side address monitor unit corresponding to a first cache portion of a distributed cache that has a total number of cache-side address monitor storage locations less than a total number of logical processors of the processor. Each cache-side address monitor storage location is to store an address to be monitored. A core-side address monitor unit corresponds to a first core and has a same number of core-side address monitor storage locations as a number of logical processors of the first core. Each core-side address monitor storage location is to store an address, and a monitor state for a different corresponding logical processor of the first core. A cache-side address monitor storage overflow unit corresponds to the first cache portion, and is to enforce an address monitor storage overflow policy when no unused cache-side address monitor storage location is available to store an address to be monitored.
Abstract:
Embodiments may be generally direct to apparatuses, systems, method, and techniques to provide multi-interconnect protocol communication. In an embodiment, an apparatus for providing multi-interconnect protocol communication may include a component comprising at least one connector operative to connect the component to at least one off-package device via a standard interconnect protocol, and logic, at least a portion of the logic comprised in hardware, the logic to determine data to be communicated via a multi-interconnect protocol, provide the data to a multi-protocol multiplexer to determine a route for the data, route the data on-package responsive to the multi-protocol multiplexer indicating a multi-interconnect on-package mode, and route the data off-package via the at least one connector responsive to the multi-protocol multiplexer indicating a multi-interconnect off-package mode. Other embodiments are described.
Abstract:
A microelectronic assembly is provided comprising: a first plurality of integrated circuit (IC) dies arranged in an array of rows and columns in a first layer; and a second plurality of IC dies in a second layer not coplanar with the first layer. A first IC die in the first plurality is differently sized than surrounding IC dies in the first plurality, and a second IC die in the second plurality coupled to the first IC die comprises at least one of: a repeater circuitry and a fanout structure in an electrical pathway coupling the first IC die with an adjacent IC die in the first plurality.