摘要:
In response to a need to initiate one or more global operations, a bus master within a multiprocessor system issues a combined token and operation request in a single bus transaction on a bus coupled to the bus master. The combined token and operation request solicits a single existing token required to complete the global operations within the multiprocessor system and identifies the first of the global operations to be processed with the token, if granted. Once a bus master is granted the token, no other bus master will be granted the token until the current token owner explicitly requests release. The current token owner repeats the combined token and operation request for each global operation which needs to be initiated and, on the last global operation, issues a combined request with an explicit release. Acknowledgement of the combined request with release implies release of the token for use by other bus masters.
摘要:
Only a single snooper queue for global operations within a multiprocessor system is implemented within each bus snooper, controlled by a single token allowing completion of one operation. A bus snooper, upon detecting a combined token and operation request, begins speculatively processing the operation if the snooper is not already busy. The snooper then watches for a combined response acknowledging the combined request or a subsequent token request from the same processor, which indicates that the originating processor has been granted the sole token for completing global operations, before completing the operation. When processing an operation from a combined request and detecting an operation request (only) from a different processor, which indicates that another processor has been granted the token, the snooper suspends processing of the current operation and begins processing the new operation. If the snooper is busy when a combined request is received, the snooper retries the operation portion of the combined request and, upon detecting a subsequent operation request (only) for the operation, begins processing the operation at that time if not busy. Snoop logic for large multiprocessor systems is thus simplified, with conflict reduced to situations in which multiple processors are competing for the token.
摘要:
A system for time-ordered execution of load instructions. More specifically, the system enables just-in-time delivery of data requested by a load instruction. The system consists of a processor, an L1 data cache with corresponding L1 cache controller, and an instruction processor. The instruction processor manipulates a plurality of architected time dependency fields of a load instruction to create a plurality of dependency fields. The dependency fields holds a relative dependency value which is utilized to order the load instruction in a Relative Time-Ordered Queue (RTOQ) of the L1 cache controller. The load instruction is sent from RTOQ to the L1 data cache at a particular time so that the data requested is loaded from the L1 data cache at the time specified by one of the dependency fields. The dependency fields are prioritized so that the cycle corresponding to the highest priority field which is available is utilized.
摘要:
Logically in line caches within a multilevel cache hierarchy are jointly controlled by single cache controller. By combining the cache controller and snoop logic for different levels within the cache hierarchy, separate queues are not required for each level. During a cache access, cache directories are looked up in parallel. Data is retrieved from an upper cache if hit, or from the lower cache if the upper cache misses and the lower cache hits. LRU units may be updated in parallel based on cache directory hits. Alternatively, the lower cache LRU unit may be updated based on cache memory accesses rather than cache directory hits, or the cache hierarchy may be provided with user selectable modes of operation for both LRU unit update schemes. The merged vertical cache controller mechanism does not require the lower cache memory to be inclusive of the upper cache memory. A novel deallocation scheme and update protocol may be implemented in conjunction with the merged vertical cache controller mechanism.
摘要:
Combined response logic for a bus receives a combined data access and cast out/deallocate operation initiating by a storage device within a specific level of a storage hierarchy, with a coherency state and LRU position of the cast out/deallocate victim appended. Snoopers on the bus drive snoop responses to the combined operation with the coherency state and/or LRU position of locally-stored cache lines corresponding to the victim appended. The combined response logic determines, from the coherency state and LRU position information appended to the combined operation and the snoop responses, whether an update of the LRU position and/or coherency state of a cache line corresponding to the victim within one of the snoopers is required. If so, the combined response logic selects a snooper storage device to have at least the LRU position of a respective cache line corresponding to the victim updated, and appends an update command identifying the selected snooper to the combined response. The snooper selected to be updated may be randomly chosen, selected based on LRU position of the cache line corresponding to the victim within respective storage, or selected based on other criteria.
摘要:
In cancelling the cast out portion of a combined operation including a data access related to the cast out, the combined response logic explicitly directs the storage device initiating the combined operation not to allocate storage for the target of the data access. Instead, the target of the data access may be passed directly to an in-line processor core without storage, may be stored in a horizontal storage device, or may be stored in an in-line, noninclusive, lower level storage device. Cancellation of the cast out thus defers any latency associated with writing the cast out victim to system memory while maximizing utilization of available storage with acceptable tradeoffs in data access latency.
摘要:
A cache and method of maintaining cache coherency in a data processing system are described. The data processing system includes a plurality of processors and a plurality of caches coupled to an interconnect. According to the method, a first data item is stored in a first of the caches in association with an address tag indicating an address of the first data item. A coherency indicator in the first cache is set to a first state that indicates that the tag is valid and that the first data item is invalid. Thereafter, the interconnect is snooped to detect a data transfer initiated by another of the plurality of caches, where the data transfer is associated with the address indicated by the address tag and contains a valid second data item. In response to detection of such a data transfer while the coherency indicator is set to the first state, the first data item is replaced by storing the second data item in the first cache in association with the address tag. In addition, the coherency indicator is updated to a second state indicating that the second data item is valid and that the first cache can supply said second data item in response to a request.
摘要:
A register associated with the architected logic queue of a memory-coherent device within a multiprocessor system contains a flag set whenever an architected operation—one which might affect the storage hierarchy as perceived by other devices within the system—is posted in the snoop queue of a remote snooping device. The flag remains set and is reset only when a synchronization instruction (such as the “sync” instruction supported by the PowerPC™ family of devices) is received from a local processor. The state of the flag thus provides historical information regarding architected operations which may be pending in other devices within the system after being snooped from the system bus. This historical information is utilized to determine whether a synchronization operation should be presented on the system bus, allowing unnecessary synchronization operations to be filtered and additional system bus cycles made available for other purposes. When a local processor issues a synchronization instruction to the device managing the architected logic queue, the instruction is generally accepted when the architected logic queue is empty. Otherwise the architected operation is retried back to the local processor until the architected logic queue becomes empty. If the flag is set when the synchronization instruction is accepted from the local processor, it is presented on the system bus. If the flag is not set when the synchronization instruction is received from the local processor, the synchronization operation is unnecessary and is not presented on the system bus.
摘要:
A method and system for increasing system memory bandwidth within a symmetric multiprocessor data-processing system are disclosed. The symmetric multiprocessor data-processing system includes several processing units. With conventional systems, all these processing units are typically coupled to a system memory via an interconnect. In order to increase the bandwidth of the system memory, the system memory is first divided into multiple partial system memories, wherein an aggregate of contents within all of these partial system memories equals to the contents of the system memory. Then, each of the processing units is individually associated with one of the partial system memories, such that the bandwidth of the system memory within the symmetric multiprocessor data-processing system is increased.
摘要:
A method of storing values in a cache used by a processor of a computer system, the cache having two or more cache directories. An address tag associated with the memory block is written into a first cache directory during an initial processor cycle, the address tag is written into a second cache directory during the next or subsequent processor cycle. Another address tag associated with a different memory block may be read from the second cache directory during the initial processor cycle. Additionally, another address tag associated with yet a different memory block may be read from the first cache directory during the subsequent processor cycle. A write operation for the address tag may be placed into a write queue of the first cache directory, prior to writing the address tag into the first cache directory, and the same write operation may be placed into a write queue of the second cache directory, prior to said step of writing the address tag into the second cache directory; the write queue of the second cache directory executes its contents independently of the write queue of the first cache directory. This staggered writing ability imparts greater flexibility in carrying out write operations for a cache having multiple directories, thereby increasing performance.