摘要:
A technique for performing data prefetching using indirect addressing includes determining a first memory address of a pointer associated with a data prefetch instruction. Content, that is included in a first data block (e.g., a first cache line) of a memory, at the first memory address is then fetched. An offset is then added to the content of the memory at the first memory address to provide a first offset memory address. A second memory address is then determined based on the first offset memory address. A second data block (e.g., a second cache line) that includes data at the second memory address is then fetched (e.g., from the memory or another memory). A data prefetch instruction may be indicated by a unique operational code (opcode), a unique extended opcode, or a field (including one or more bits) in an instruction.
摘要:
A memory system, data processing system, and method are provided for using cache that is embedded in a memory hub device to replace failed memory cells. A memory module comprises an integrated memory hub device. The memory hub device comprises an integrated memory device data interface that communicates with a set of memory devices coupled to the memory hub device and a cache integrated in the memory hub device. The memory hub device also comprises an integrated memory hub controller that controls the data that is read or written by the memory device data interface to the cache based on a determination whether one or more memory cells within the set of memory devices has failed.
摘要:
A method for providing a cluster-wide system clock in a multi-tiered full graph (MTFG) interconnect architecture are provided. Heartbeat signals transmitted by each of the processor chips in the computing cluster are synchronized. Internal system clock signals are generated in each of the processor chips based on the synchronized heartbeat signals. As a result, the internal system clock signals of each of the processor chips are synchronized since the heartbeat signals, that are the basis for the internal system clock signals, are synchronized. Mechanisms are provided for performing such synchronization using direct couplings of processor chips within the same processor book, different processor books in the same supernode, and different processor books in different supernodes of the MTFG interconnect architecture.
摘要:
A system and method for providing hardware based dynamic load balancing of message passing interface (MPI) tasks are provided. Mechanisms for adjusting the balance of processing workloads of the processors executing tasks of an MPI job are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. Each processor has an associated hardware implemented MPI load balancing controller. The MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. As a result, operations may be performed to shift workloads from the slowest processor to one or more of the faster processors.
摘要:
A method is provided for implementing a multi-tiered full-graph interconnect architecture. In order to implement a multi-tiered full-graph interconnect architecture, a plurality of processors are coupled to one another to create a plurality of processor books. The plurality of processor books are coupled together to create a plurality of supernodes. Then, the plurality of supernodes are coupled together to create the multi-tiered full-graph interconnect architecture. Data is then transmitted from one processor to another within the multi-tiered full-graph interconnect architecture based on an addressing scheme that specifies at least a supernode and a processor book associated with a target processor to which the data is to be transmitted.
摘要:
A system is provided for implementing a multi-tiered full-graph interconnect architecture. In order to implement a multi-tiered full-graph interconnect architecture, a plurality of processors are coupled to one another to create a plurality of processor books. The plurality of processor books are coupled together to create a plurality of supernodes. Then, the plurality of supernodes are coupled together to create the multi-tiered full-graph interconnect architecture. Data is then transmitted from one processor to another within the multi-tiered full-graph interconnect architecture based on an addressing scheme that specifies at least a supernode and a processor book associated with a target processor to which the data is to be transmitted.
摘要:
A method, computer program product, and system are provided for selecting, from a plurality of routes through the data processing system, an indirect route for transmitting data. Data that includes address information is received at a first processor that is to be transmitted to a destination processor. Using routing table data structures, indirect route entries are identified that correspond to indirect routes for transmitting data. An accessed priority table data structure comprises a priority entry for each entry in the routing table data structures. The priority entry specifies a priority of a corresponding entry in the routing table data structures. An indirect route entry is selected that corresponds to an indirect route from the routing table data structures, based on specified priorities. Then the data is transmitted from the first processor to the destination processor using a path corresponding to the selected indirect route entry.
摘要:
A method, computer program product, and system are provided for selecting, from a plurality of routes through the data processing system, a direct route for transmitting data. Data that includes address information is received at a first processor that is to be transmitted to a destination processor. Using routing table data structures, direct route entries are identified that correspond to direct routes for transmitting data. An accessed priority table data structure comprises a priority entry for each entry in the routing table data structures. The priority entry specifies a priority of a corresponding entry in the routing table data structures. A direct route entry is selected that corresponds to a direct route from the routing table data structures, based on specified priorities. Then the data is transmitted from the first processor to the destination processor using a path corresponding to the selected direct route entry.
摘要:
In addition to an address tag, a coherency state and an LRU position, each cache directory entry includes historical processor access information for the corresponding cache line. The historical processor access information includes different subentries for each different processor which has accessed the corresponding cache line, with subentries being “pushed” along the stack when a new processor accesses the subject cache line. Each subentries contains the processor identifier for the corresponding processor which accessed the cache line, one or more opcodes identifying the operations which were performed by the processor, and timestamps associated with each opcode. This historical processor access information may then be utilized by the cache controller to influence victim selection, coherency state transitions, LRU state transitions, deallocation timing, and other cache management functions so that smaller caches are given the effectiveness of very large caches through more intelligent cache management.
摘要:
An input/output channel controller includes a storage array for temporarily storing data and multiple clocks to access or update the data. One or more array clock signals are generated from a system clock combined with other clock signals to generate a single clock signal which is positioned in time by a clock positioning circuit to accommodate circuit throughput delay variations and to effectively reduce hold time to zero. Storage arrays may be clocked at significantly higher frequencies and arrays may have multiple gated clocks without incurring the hold time problems.