摘要:
The invention set forth herein describes a mechanism for efficiently performing extended precision operations on multi-word source operands. Corresponding data words of the source operands are processed together via each instruction of a cascading sequence of instructions. State information generated when each instruction is processed is stored in condition code flags. The state information is optionally used in the processing of subsequent instructions in the sequence and/or accumulated with previously set state information.
摘要:
One embodiment of the present invention sets forth a technique for a program to access multi-dimensional formatted graphics surface memory. Multi-dimensional memory objects called “surfaces” stored in a user-specified data or pixel format and arranged in a graphics optimized layout are accessed by programs using surface instructions. A set of memory access instructions e.g., load, store, reduce, and atomic, referred to as surface instructions, may be used to access the surfaces. Coordinate bounds checking is performed with configurable clamping. Caching behavior may also be specified by the surface instructions. Data format conversion and packing to a specified storage format is supported for store, reduction, and atomic surface instructions. Data format conversion and unpacking from a specified storage format is supported for loads and atomic surface instructions.
摘要:
A “cooperative thread array,” or “CTA,” is a group of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique thread identifier assigned at thread launch time that controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Different threads of the CTA are advantageously synchronized at appropriate points during CTA execution using a barrier synchronization technique in which barrier instructions in the CTA program are detected and used to suspend execution of some threads until a specified number of other threads also reaches the barrier point.
摘要:
A graphics processing unit can queue a large number of texture requests to balance out the variability of texture requests without the need for a large texture request buffer. A dedicated texture request buffer queues the relatively small texture commands and parameters. Additionally, for each queued texture command, an associated set of texture arguments, which are typically much larger than the texture command, are stored in a general purpose register. The texture unit retrieves texture commands from the texture request buffer and then fetches the associated texture arguments from the appropriate general purpose register. The texture arguments may be stored in the general purpose register designated as the destination of the final texture value computed by the texture unit. Because the destination register must be allocated for the final texture value as texture commands are queued, storing the texture arguments in this register does not consume any additional registers.
摘要:
A memory management system provides the ability for multiple requesters to access blocks of memory in a pipelined manner. During a first clock, requests for one or more of the memory blocks are received by the system. A determination is made of whether one of the memory blocks is requested by one or more requests. If the same memory block is requested by two or more requests, the system performs a further determination of which of the requests will be provided to the memory block. The determined request is provided to the memory block on the first clock. During a second clock, the data of the determined request is latched to the memory block and a memory access is initiated. If the request is a write request, the data is written to the memory block. If the request is a read request, then the requested data is retrieved and, on a third clock, the data is driven onto a bus, routed to the determined requester, and available to be latched into the requester on the fourth clock.
摘要:
Circuits, methods, and apparatus for using redundant circuitry on integrated circuits in order to increase manufacturing yields. One exemplary embodiment of the present invention provides a circuit configuration wherein functional circuit blocks in a group of circuit blocks are selected by multiplexers. Multiplexers at the input and output of the group of circuit blocks steer input and output signals to and from functional circuit blocks, avoiding circuit blocks found to be defective or nonfunctional. Multiple groups of these circuit blocks may be arranged in series and in parallel. Alternate multiplexer configurations may be used in order to provide a higher level of redundancy. Other embodiments use all functional circuit blocks and sort integrated circuits based on the level of functionality or performance. Other embodiments provide methods of testing integrated circuits having one or more of these circuit configurations.
摘要:
A massively parallel computer system is disclosed having a global router network in which pipeline registers are spatially distributed to increase the messaging speed of the global router network. The global router network includes an expansion tap for processor to I/O messaging so that I/O messaging bandwidth matches interprocessor messaging bandwidth. A route-opening message packet includes protocol bits which are treated homogeneously with steering bits. The route-opening packet further includes redundant address bits for imparting a multiple-crossbars personality to router chips within the global router network. A structure and method for spatially supporting the processors of the massively parallel system and the global router network are also disclosed.
摘要:
To effect a block data transfer between a plurality of physical I/O devices coupled through interfaces to an I/O channel ("IOC") bus, a source logical device is established by programmably assigning to each of the physical device interfaces a logical device identifier, a leaf identifier determining when the physical device participates relative to the first data transfer in the block data transfer, a burst count specifying the number of consecutive transfers for which the physical device is responsible when its interleave period arrives, and an interleave factor identifying how often the physical device participates in the block data transfer. A destination logical device is similarly established. The source and logical devices are then activated to accomplish a block transfer of data between them. To permit different I/O processors to operate independently in making I/O requests, requests from each I/O processor are communicated to an IOC controller over another bus, which need not be a high performance bus, and are serviced to construct header packets in a transaction buffer identifying IOC transactions, including source and destination logical devices. When each packet is finished, the responsible I/O processor puts a pointer into a transaction queue, which is a FIFO register. Each IOC transaction is initiated as its corresponding pointer is popped from the transaction queue. Apparatus embodiments are disclosed as well.
摘要:
One embodiment of the present invention sets forth a technique for efficiently performing voting operations within a multi-threaded parallel-processing system. A group of related parallel program threads executes within a processor core together in parallel. A new instruction, called a “vote” instruction, is introduced that enables a parallel program thread to post an individual vote within the context of the group of related threads and to receive the result of the vote. In this fashion, the vote instruction advantageously reduces overhead associated with inter-thread communication, thereby improving overall system performance.
摘要:
One embodiment of the present invention sets forth a technique for a program to access multi-dimensional formatted graphics surface memory. Multi-dimensional memory objects called “surfaces” stored in a user-specified data or pixel format and arranged in a graphics optimized layout are accessed by programs using surface instructions. A set of memory access instructions e.g., load, store, reduce, and atomic, referred to as surface instructions, may be used to access the surfaces. Coordinate bounds checking is performed with configurable clamping. Caching behavior may also be specified by the surface instructions. Data format conversion and packing to a specified storage format is supported for store, reduction, and atomic surface instructions. Data format conversion and unpacking from a specified storage format is supported for loads and atomic surface instructions.