Abstract:
Systems, methods, and apparatuses for reducing writes to the data array of a cache. A cache hierarchy includes one or more L1 caches and a L2 cache inclusive of the L2 cache(s). When a request from the L1 cache misses in the L2 cache, the L2 cache sends a fill request to memory. When the fill data returns from memory, the L2 cache delays writing the fill data to its data array. Instead, this cache line is written to the L1 cache and a clean-evict bit corresponding to the cache line is set in the L1 cache. When the L1 cache evicts this cache line, the L1 cache will write back the cache line to the L2 cache even if the cache line has not been modified.
Abstract:
Processors and methods for preventing lower level prefetch units from stalling at page boundaries. An upper level prefetch unit closest to the processor core issues a preemptive request for a translation of the next page in a given prefetch stream. The upper level prefetch unit sends the translation to the lower level prefetch units prior to the lower level prefetch units reaching the end of the current page for the given prefetch stream. When the lower level prefetch units reach the boundary of the current page, instead of stopping, these prefetch units can continue to prefetch by jumping to the next physical page number provided in the translation.
Abstract:
Techniques are disclosed relating to data synchronization barrier operations. A system includes a first processor that may receive a data barrier operation request from a second processor include in the system. Based on receiving that data barrier operation request from the second processor, the first processor may ensure that outstanding load/store operations executed by the first processor that are directed to addresses outside of an exclusion region have been completed. The first processor may respond to the second processor that the data barrier operation request is complete at the first processor, even in the case that one or more load/store operations that are directed to addresses within the exclusion region are outstanding and not complete when the first processor responds that the data barrier operation request is complete.
Abstract:
Techniques are disclosed relating to data synchronization barrier operations. A system includes a first processor that may receive a data barrier operation request from a second processor include in the system. Based on receiving that data barrier operation request from the second processor, the first processor may ensure that outstanding load/store operations executed by the first processor that are directed to addresses outside of an exclusion region have been completed. The first processor may respond to the second processor that the data barrier operation request is complete at the first processor, even in the case that one or more load/store operations that are directed to addresses within the exclusion region are outstanding and not complete when the first processor responds that the data barrier operation request is complete.
Abstract:
In an embodiment, a coprocessor may include a bypass indication which identifies execution circuitry that is not used by a given processor instruction, and thus may be bypassed. The corresponding circuitry may be disabled during execution, preventing evaluation when the output of the circuitry will not be used for the instruction. In another embodiment, the coprocessor may implement a grid of processing elements in rows and columns, where a given coprocessor instruction may specify an operation that causes up to all of the processing elements to operate on vectors of input operands to produce results. Implementations of the coprocessor may implement a portion of the processing elements. The coprocessor control circuitry may be designed to operate with the full grid or partial grid, reissuing instructions in the partial grid case to perform the requested operation. In still another embodiment, the coprocessor may be able to fuse vector mode operations.
Abstract:
A system and method for efficiently transferring address mappings and data access permissions corresponding to the address mappings. A computing system includes at least one processor and memory for storing a page table. In response to receiving a memory access operation comprising a first address, the address translation unit is configured to identify a data access permission based on a permission index corresponding to the first address, and access data stored in a memory location of the memory identified by a second address in a manner defined by the retrieved data access permission. The address translation unit is configured to access a table to identify the data access permission, and is configured to determine the permission index and the second address based on the first address. A single permission index may correspond to different permissions for different entities within the system.
Abstract:
Systems, apparatuses, and methods for retaining architected state for relatively frequent switching between sleep and active operating states are described. A processor receives an indication to transition from an active state to a sleep state. The processor stores a copy of a first subset of the architected state information in on-die storage elements capable of retaining storage after power is turned off. The processor supports programmable input/output (PIO) access of particular stored information during the sleep state. When a wakeup event is detected, circuitry within the processor is powered up again. A boot sequence and recovery of architected state from off-chip memory are not performed. Rather than fetch from a memory location pointed to by a reset base address register, the processor instead fetches an instruction from a memory location pointed to by a restored program counter of the retained subset of the architected state information.
Abstract:
In an embodiment, a processor may implement an access map-pattern match (AMPM)-based prefetch circuit for a multi-level cache system. The access patterns that are matched to the access maps may include prefetches for different cache levels. Centralizing the generation of prefetches into one prefetch circuit may provide better observability and controllability of prefetching at various levels of the cache hierarchy, in an embodiment. Prefetches at different levels may be controlled individually based on the accuracy of those prefetches, in an embodiment. Additionally, in an embodiment, access patterns that are longer that a given threshold may have the granularity of the prefetches change so that more data is prefetched and the prefetches occur farther in advance, in some embodiments.
Abstract:
A processor includes an instruction issue circuit, and high-utilization and low-utilization execution unit circuits coupled to execute instructions received from the instruction issue unit. On average, utilization of the low-utilization execution unit circuit is lower than utilization of the high-utilization execution unit circuit. The processor also includes a retention circuit coupled to a different power domain than the low-utilization execution unit circuit, and a power management circuit. The power management circuit may be configured to detect that inactivity of the low-utilization execution unit circuit satisfies a threshold inactivity level; upon detecting that the threshold inactivity level is satisfied, cause architecturally-visible state of the low-utilization execution unit circuit to be copied to the retention circuit; and subsequent to copying of the architecturally-visible state to the retention circuit, cause the low-utilization execution unit circuit to enter a power-off state, where the retention circuit retains stored data during the power-off state.
Abstract:
In an embodiment, a coprocessor may include a bypass indication which identifies execution circuitry that is not used by a given processor instruction, and thus may be bypassed. The corresponding circuitry may be disabled during execution, preventing evaluation when the output of the circuitry will not be used for the instruction. In another embodiment, the coprocessor may implement a grid of processing elements in rows and columns, where a given coprocessor instruction may specify an operation that causes up to all of the processing elements to operate on vectors of input operands to produce results. Implementations of the coprocessor may implement a portion of the processing elements. The coprocessor control circuitry may be designed to operate with the full grid or partial grid, reissuing instructions in the partial grid case to perform the requested operation. In still another embodiment, the coprocessor may be able to fuse vector mode operations.