Abstract:
A method and apparatus for performing non-temporal write combining using existing cache resources is disclosed. In one embodiment, a method includes executing a first thread on a processor core, the first thread including a first block initialization store (BIS) instruction. A cache query may be performed responsive to the BIS instruction, and if the query results in a cache miss, a cache line may be installed in a cache in an unordered dirty state in which it is exclusively owned by the first thread. The first BIS instruction and one or more additional BIS instructions may write data from the first processor core into the first cache line. After a cache coherence response is received, the state of the first cache line may be changed to an ordered dirty state in which it is no longer exclusive to the first thread.
Abstract:
A method for cache coherence, including: broadcasting, by a requester cache (RC) over a partially-ordered request network (RN), a peer-to-peer (P2P) request for a cacheline to a plurality of slave caches; receiving, by the RC and over the RN while the P2P request is pending, a forwarded request for the cacheline from a gateway; receiving, by the RC and after receiving the forwarded request, a plurality of responses to the P2P request from the plurality of slave caches; setting an intra-processor state of the cacheline in the RC, wherein the intra-processor state also specifies an inter-processor state of the cacheline; and issuing, by the RC, a response to the forwarded request after setting the intra-processor state and after the P2P request is complete; and modifying, by the RC, the intra-processor state in response to issuing the response to the forwarded request.
Abstract:
A cache memory that selectively enables and disables speculative reads from system memory is disclosed. The cache memory may include a plurality of partitions, and a plurality of registers. Each register may be configured to stored data indicative of a source of returned data for previous requests directed to a corresponding partition. Circuitry may be configured to receive a request for data to a given partition. The circuitry may be further configured to read contents of a register corresponding to the given partition, and initiate a speculative read dependent upon the contents of the register.
Abstract:
A system includes a number of processors with each processor including a cache memory. The system also includes a number of directory controllers coupled to the processors. Each directory controller may be configured to administer a corresponding cache coherency directory. Each cache coherency directory may be configured to track a corresponding set of memory addresses. Each processor may be configured with information indicating the corresponding set of memory addresses tracked by each cache coherency directory. Directory redundancy operations in such a system may include identifying a failure of one of the cache coherency directories; reassigning the memory address set previously tracked by the failed cache coherency directory among the non-failed cache coherency directories; and reconfiguring each processor with information describing the reassignment of the memory address set among the non-failed cache coherency directories.
Abstract:
A method and system for processing data are disclosed. A processor, in response to executing a software program, may write an entry in a work queue. The entry may include an operation, and a location of data stored in an input buffer, and a location in an output buffer to write processed data. The processor may also generate a notification that at least one entry in the work queue is ready to be processed. The data transformation unit may assign the entry to a data transformation circuit, and retrieve the data from the input buffer using the location. The data transformation unit may also perform to the operation on the retrieved data to generate updated data, generate a completion message in response to completion of the operation, and store the updated data in an output buffer. An interface unit may relay transactions between the processor and the data transformation unit.
Abstract:
A method, system, and computer-readable medium for evicting cache lines that includes determining that a first cache line is to be evicted from a first Last Level Cache (LLC) partition of a partitioned LLC, and sending, based on the determination, a first notification to a second LLC partition of the partitioned LLC. The method may also include receiving, in response to the first notification, an available indication indicating that the second LLC partition is available as a designated victim cache partition; performing a selection of the second LLC partition as the designated victim cache partition; and evicting the first cache line to the second LLC partition based on the selection.
Abstract:
A cache memory that selectively enables and disables speculative reads from system memory is disclosed. The cache memory may include a plurality of partitions, and a plurality of registers. Each register may be configured to stored data indicative of a source of returned data for previous requests directed to a corresponding partition. Circuitry may be configured to receive a request for data to a given partition. The circuitry may be further configured to read contents of a register corresponding to the given partition, and initiate a speculative read dependent upon the contents of the register.
Abstract:
A method and apparatus for snooping caches is disclosed. In one embodiment, a system includes a number of processing nodes and a cache shared by each of the processing nodes. The cache is partitioned such that each of the processing nodes utilizes only one assigned partition. If a query by a processing node to its assigned partition of the cache results in a miss, a cache controller may determine whether to snoop other partitions in search of the requested information. The determination may be made based on history of where requested information was obtained from responsive to previous misses in that partition.
Abstract:
A method and apparatus for performing non-temporal write combining using existing cache resources is disclosed. In one embodiment, a method includes executing a first thread on a processor core, the first thread including a first block initialization store (BIS) instruction. A cache query may be performed responsive to the BIS instruction, and if the query results in a cache miss, a cache line may be installed in a cache in an unordered dirty state in which it is exclusively owned by the first thread. The first BIS instruction and one or more additional BIS instructions may write data from the first processor core into the first cache line. After a cache coherence response is received, the state of the first cache line may be changed to an ordered dirty state in which it is no longer exclusive to the first thread.
Abstract:
The systems and methods described herein may provide a flush-retire instruction for retiring “bad” cache locations (e.g., locations associated with persistent errors) to prevent their allocation for any further accesses, and a flush-unretire instruction for unretiring cache locations previously retired. These instructions may be implemented as hardware instructions of a processor. They may be executable by processes executing in a hyper-privileged state, without the need to quiesce any other processes. The flush-retire instruction may atomically flush a cache line implicated by a detected cache error and set a lock bit to disable subsequent allocation of the corresponding cache location. The flush-unretire instruction may atomically flush an identified cache line (if valid) and clear the lock bit to re-enable subsequent allocation of the cache location. Various bits in the encodings of these instructions may identify the cache location to be retired or unretired in terms of the physical cache structure.