摘要:
Technologies for dynamic home tile mapping are described. an address request can be received from a processing core, the processing core being associated with a home tile table, the home tile table including respective mappings of one or more directory addresses to one or more home tiles. A buffer can be scanned to identify a presence of the address within the buffer. Based on an identification of the presence of the address within the buffer, a home tile identifier corresponding to the address can be provided from the buffer.
摘要:
An apparatus and method for determining whether to execute an atomic operation locally or remotely. For example, one embodiment of a processor comprises: a decoder to decode an atomic operation on a local core; prediction logic on the local core to estimate a cost associated with execution of the atomic operation on the local core and a cost associated with execution of the atomic operation on a remote core; and the remote core to execute the atomic operation remotely if the prediction logic determines that the cost for execution on the local core is relatively greater than the cost for execution on the remote core; and the local core to execute the atomic operation locally if the prediction logic determines that the cost for local execution on the local core is relatively less than the cost for execution on the remote core.
摘要:
A processing device comprises a processing device cache and a cache controller. The cache controller initiates a cache line eviction process and determines determine an object liveness value associated with a cache line in the processing device cache. The cache controller applies the object liveness value to a cache line eviction policy and evicts the cache line from the processing device cache based on the object liveness value and the cache line eviction policy.
摘要:
In an embodiment, a processor includes a vector execution unit having a plurality of lanes to execute operations on vector operands, a performance monitor coupled to the vector execution unit to maintain information regarding an activity level of the lanes, and a control logic coupled to the performance monitor to control power consumption of the vector execution unit based at least in part on the activity level of at least some of the lanes. Other embodiments are described and claimed.
摘要:
A processor includes a core, a hardware prefetcher, and a prefetcher control module. The hardware prefetcher includes logic to make speculative prefetch requests, through a memory subsystem, for elements for execution by the core, and logic to store prefetched elements in a cache. The prefetcher control module includes logic to selectively suppress, based on a hardware-prefetch suppression instruction executed by the core, a speculative prefetch request to be made by the hardware prefetcher.
摘要:
Instructions and logic provide pushing buffer copy and store functionality. Some embodiments include a first hardware thread or processing core, and a second hardware thread or processing core, a cache to store cache coherent data in a cache line for a shared memory address accessible by the second hardware thread or processing core. Responsive to decoding an instruction specifying a source data operand, said shared memory address as a destination operand, and one or more owner of said shared memory address, one or more execution units copy data from the source data operand to the cache coherent data in the cache line for said shared memory address accessible by said second hardware thread or processing core in the cache when said one or more owner includes said second hardware thread or processing core.
摘要:
In an embodiment, a processor includes a sparse access buffer having a plurality of entries each to store for a memory access instruction to a particular address, address information and count information; and a memory controller to issue read requests to a memory, the memory controller including a locality controller to receive a memory access instruction having a no-locality hint and override the no-locality hint based at least in part on the count information stored in an entry of the sparse access buffer. Other embodiments are described and claimed.
摘要:
Methods and apparatus implementing Hardware/Software co-optimization to improve performance and energy for inter-VM communication for NFVs and other producer-consumer workloads. The apparatus include multi-core processors with multi-level cache hierarchies including and L1 and L2 cache for each core and a shared last-level cache (LLC). One or more machine-level instructions are provided for proactively demoting cachelines from lower cache levels to higher cache levels, including demoting cachelines from L1/L2 caches to an LLC. Techniques are also provided for implementing hardware/software co-optimization in multi-socket NUMA architecture system, wherein cachelines may be selectively demoted and pushed to an LLC in a remote socket. In addition, techniques are disclosure for implementing early snooping in multi-socket systems to reduce latency when accessing cachelines on remote sockets.
摘要:
In one embodiment, a processor may include a vector unit to perform operations on multiple data elements responsive to a single instruction, and a control unit coupled to the vector unit to provide the data elements to the vector unit, where the control unit is to enable an atomic vector operation to be performed on at least some of the data elements responsive to a first vector instruction to be executed under a first mask and a second vector instruction to be executed under a second mask. Other embodiments are described and claimed.
摘要:
A scatter/gather technique optimizes unstructured streaming memory accesses, providing off-chip bandwidth efficiency by accessing only useful data at a fine granularity, and off-loading memory access overhead by supporting address calculation, data shuffling, and format conversion.