Abstract:
A technique for processing an instruction sequence that includes a barrier instruction, a load instruction preceding the barrier instruction, and a subsequent memory access instruction following the barrier instruction includes determining, by a processor core, that the load instruction is resolved based upon receipt by the processor core of an earliest of a good combined response for a read operation corresponding to the load instruction and data for the load instruction. The technique also includes if execution of the subsequent memory access instruction is not initiated prior to completion of the barrier instruction, initiating by the processor core, in response to determining the barrier instruction completed, execution of the subsequent memory access instruction. The technique further includes if execution of the subsequent memory access instruction is initiated prior to completion of the barrier instruction, discontinuing by the processor core, in response to determining the barrier instruction completed, tracking of the subsequent memory access instruction with respect to invalidation.
Abstract:
A multiprocessor data processing system includes multiple vertical cache hierarchies supporting a plurality of processor cores, a system memory, and an interconnect fabric coupled to the system memory and the multiple vertical cache hierarchies. Based on a request of a requesting processor core among the plurality of processor cores, a master in the multiprocessor data processing system issues, via the interconnect fabric, a read-type memory access request. The master receives via the interconnect fabric at least one beat of conditional data issued speculatively on the interconnect fabric by a controller of the system memory prior to receipt by the controller of a systemwide coherence response for the read-type memory access request. The master forwards the at least one beat of conditional data to the requesting processor core.
Abstract:
A data processing system includes a plurality of processor cores each supported by a respective one of a plurality of vertical cache hierarchies. Based on receiving on a system fabric a cache injection request requesting injection of a data into a cache line identified by a target real address, the data is written into a cache in a first vertical cache hierarchy among the plurality of vertical cache hierarchies. Based on a value in a field of the cache injection request, a distribute field is set in a directory entry of the first vertical cache hierarchy. Upon eviction of the cache line the first vertical cache hierarchy, a determination is made whether the distribute field is set. Based on determining the distribute field is set, a lateral castout of the cache line from the first vertical cache hierarchy to a second vertical cache hierarchy is performed.
Abstract:
A multiprocessor data processing system includes a processor core having a translation structure for buffering a plurality of translation entries. The processor core receives a sequence of a plurality of translation invalidation requests. In response to receipt of each of the plurality of translation invalidation requests, the processor core determines that each of the plurality of translation invalidation requests indicates that it does not require draining of memory referent instructions for which address translation has been performed by reference to a respective one of a plurality of translation entries to be invalidated. Based on the determination, the processor core invalidates the plurality of translation entries in the translation structure without regard to draining from the processor core of memory access requests for which address translation was performed by reference to the plurality of translation entries.
Abstract:
A data processing system includes a processor core having a shared store-through upper level cache and a store-in lower level cache. The processor core executes a plurality of simultaneous hardware threads of execution including at least a first thread and a second thread, and the shared store-through upper level cache stores a first cache line accessible to both the first thread and the second thread. The processor core executes in the first thread a store instruction that generates a store request specifying a target address of a storage location corresponding to the first cache line. Based on the target address hitting in the shared store-through upper level cache, the first cache line is temporarily marked, in the shared store-through upper level cache, as private to the first thread, such that any memory access request by the second thread targeting the storage location will miss in the shared store-through upper level cache.
Abstract:
A processing unit for a data processing system includes a cache memory having reservation logic and a processor core coupled to the cache memory. The processor includes an execution unit that executes instructions in a plurality of concurrent hardware threads of execution including at least first and second hardware threads. The instructions include, within the first hardware thread, a first load-reserve instruction that identifies a target address for which a reservation is requested. The processor core additionally includes a load unit that records the target address of the first load-reserve instruction and that, responsive to detecting, in the second hardware thread, a second load-reserve instruction identifying the target address recorded by the load unit, blocks the second load-reserve instruction from establishing a reservation for the target address in the reservation logic.
Abstract:
Ensuring forward progress for nested translations in a memory management unit (MMU) including receiving a plurality of nested translation requests, wherein each of the plurality of nested translation requests requires at least one congruence class lock; detecting, using a congruence class scoreboard, a collision of the plurality of nested translation requests based on the required congruence class locks; quiescing, in response to detecting the collision of the plurality of nested translation requests, a translation pipeline in the MMU including switching operation of the translation pipeline from a multi-thread mode to a single-thread mode and marking a first subset of the plurality of nested translation requests as high-priority nested translation requests; and servicing the high-priority nested translation requests through the translation pipeline in the single-thread mode.
Abstract:
Ensuring forward progress for nested translations in a memory management unit (MMU) including receiving a plurality of nested translation requests, wherein each of the plurality of nested translation requests requires at least one congruence class lock; detecting, using a congruence class scoreboard, a collision of the plurality of nested translation requests based on the required congruence class locks; quiescing, in response to detecting the collision of the plurality of nested translation requests, a translation pipeline in the MMU including switching operation of the translation pipeline from a multi-thread mode to a single-thread mode and marking a first subset of the plurality of nested translation requests as high-priority nested translation requests; and servicing the high-priority nested translation requests through the translation pipeline in the single-thread mode.
Abstract:
Reducing translation latency within a memory management unit (MMU) using external caching structures including requesting, by the MMU on a node, page table entry (PTE) data and coherent ownership of the PTE data from a page table in memory; receiving, by the MMU, the PTE data, a source flag, and an indication that the MMU has coherent ownership of the PTE data, wherein the source flag identifies a source location of the PTE data; performing a lateral cast out to a local high-level cache on the node in response to determining that the source flag indicates that the source location of the PTE data is external to the node; and directing at least one subsequent request for the PTE data to the local high-level cache.
Abstract:
In a data processing system implementing a weak memory model, a lower level cache receives, from a processor core, a plurality of copy-type requests and a plurality of paste-type requests that together indicate a memory move to be performed. The lower level cache also receives, from the processor core, a barrier request that requests enforcement of ordering of memory access requests prior to the barrier request with respect to memory access requests after the barrier request. In response to the barrier request, the lower level cache enforces a barrier indicated by the barrier request with respect to a final paste-type request ending the memory move but not with respect to other copy-type requests and paste-type requests in the memory move.