摘要:
A system for handling a plurality of single precision floating point instructions and a plurality of double precision floating point instructions that both index a same set of registers is provided. The system comprises a decode unit arranged to decode, stall, and forward at least one of the plurality of single precision and at least one of the plurality of double precision floating point instructions in a fetch group. The decode unit includes a first counter arranged to increment for each of the plurality of single precision floating point instructions forwarded down a pipeline; a second counter arranged to increment for each of the plurality of double precision floating point instructions forwarded down the pipeline; a first mask register and a second mask register. The first mask register is updated by each of the single precision floating point instructions forwarded and the second mask register is updated by each of the double precision floating point instructions forwarded.
摘要:
A technique for flattening architectural register windows into flattened space depending on a current window pointer to a register window is provided. The technique involves converting an n-bit value of a particular register in a register window to an x-bit value dependent on the current window pointer, where x is greater than n, and where the x-bit value is used for register dependency checking among a plurality of instructions.
摘要:
A method and apparatus to determine readiness of a complex instruction for retirement includes decoding a complex instruction into a plurality of helper instructions; executing the plurality of helper instructions using an execution unit; indicating the plurality of helper instructions that are alive using a live instruction register; and maintaining a complex instruction identification for the complex instruction using a complex instruction identification register.
摘要:
A technique for handling a condition code modifying instruction in an out-of-order multi-stranded processor involves providing a condition code architectural register file for each strand, providing a condition code working register file, and assigning condition code architectural register file identification information (CARF_ID) and condition code working register file identification information (CWRF_ID) to the condition code modifying instruction. CARF_ID is used to index a location in a condition code rename table to which the CWRF_ID is stored. Thereafter, upon an exception-free execution of the condition code modifying instruction, a result of the execution is copied from the condition code working register file to the condition code architectural register file dependent on CARF_ID, CWRF_ID, register type information, and strand identification information.
摘要:
An efficient branch prediction structure is described that bifurcates a branch prediction structure into at least two portions where information stored in the second portion is aliased amongst multiple entries of the first portion. In this way, overall storage (and layout area) can be reduced and scaling with a branch prediction structure that includes a (2N)K×1 branch direction entries and a (N/2)K×1 branch prediction qualifier entries is less dramatic than conventional techniques. An efficient branch prediction structure includes entries for branch direction indications and entries for branch prediction qualifier indications. The branch direction indication entries are more numerous than the branch prediction qualifier entries. An entry from the branch direction entries is selected based at least in part on a corresponding instruction instance identifier and an entry from the branch prediction qualifier entries is selected based at least in part on least significant bits of the instruction instance identifier.
摘要:
Offloading data coherence operations from a primary processing unit(s) executing instantiated code responsible for data coherence in a shared-cache cluster to a data coherence offload engine reduces resource consumption and allows for efficient sharing of data in accordance with the data coherence protocol. Some of the data coherence operations, such as consulting and maintaining a directory, generating messages, and writing a data unit can be performed by a data coherence offload engine. The data coherence offload engine indicates availability of the data unit in the memory to the appropriate instantiated code. Hence, the instantiated code (the corresponding primary processing unit) is no longer burdened with some of the work load of data coherence operations. Migration of tasks from a primary processing unit(s) to data coherence offload engines allows for efficient retrieval and writing of a requested data unit.
摘要:
A method for offloading computation flexibly to a communication adapter includes receiving a message that includes a procedure image identifier associated with a procedure image of a host application, determining a procedure image and a communication adapter processor using the procedure image identifier, and forwarding the first message to the communication adapter processor configured to execute the procedure image. The method further includes executing, on the communication adapter processor independent of a host processor, the procedure image in communication adapter memory by acquiring a host memory latch for a memory block in host memory, reading the memory block in the host memory after acquiring the host memory latch, manipulating, by executing the procedure image, the memory block in the communication adapter memory to obtain a modified memory block, committing the modified memory block to the host memory, and releasing the host memory latch.
摘要:
An interface device for a compute node in a computer cluster which performs Message Passing Interface (MPI) header matching using parallel matching units. The interface device comprises a memory that stores posted receive queues and unexpected queues. The posted receive queues store receive requests from a process executing on the compute node. The unexpected queues store headers of send requests (e.g., from other compute nodes) that do not have a matching receive request in the posted receive queues. The interface device also comprises a plurality of hardware pipelined matcher units. The matcher units perform header matching to determine if a header in the send request matches any headers in any of the plurality of posted receive queues. Matcher units perform the header matching in parallel. In other words, the plural matching units are configured to search the memory concurrently to perform header matching.
摘要:
Managing operations in a first compute node of a multi-computer system. A remote write may be received to a first address of a remote compute node. A first data structure entry may be created in a data structure, which may include the first address and status information indicating that the remote write has been received. Upon determining that the local cache of the first compute node has been updated with the remote write, the remote write may be issued to the remote compute node. Accordingly, the first data structure entry may be released upon completion of the remote write.
摘要:
A system, comprising a compute node and coupled network adapter (NA), that supports improved data transfer request buffering and a more efficient method of determining the completion status of data transfer requests. Transfer requests received by the NA are stored in a first buffer then transmitted on a network interface. When significant network delays are detected and the first buffer is full, the NA sets a flag to stop software issuing transfer requests. Compliant software checks this flag before sending requests and does not issue further requests. A second NA buffer stores additional received transfer requests that were perhaps in-transit. When conditions improve the flag is cleared and the first buffer used again. Completion status is efficiently determined by grouping network transfer requests. The NA counts received requests and completed network requests for each group. Software determines if a group of requests is complete by reading a count value.