摘要:
A method and apparatus for enhancing flexibility of instruction ordering in a multi-thread processing system that performs multiply and accumulate operations is presented. A plurality of accumulation registers is provided for storing the results of an adder, wherein each of the plurality of accumulation registers corresponds to a different thread of the plurality of threads. The contents of each of the plurality of accumulation registers can be selected as an input to the adder such that the present accumulated value can be added to a subsequently calculated produce to generate a new accumulated value.
摘要:
A method and apparatus for reducing latency in pipelined circuits that process dependent operations is presented. In order to reduce latency for dependent operations, a pre-accumulation register is included in an operation pipeline between a first operation unit and a second operation unit. The pre-accumulation register stores a first result produced by the first operation unit during a first operation. When the first operation unit completes a second operation to produce a second result, the first result stored in the pre-accumulation register is presented to the second operation unit along with the second result as input operands.
摘要:
A method and apparatus for avoiding latency in a processing system that includes a memory for storing intermediate results is presented. The processing system stores results produced by an operation unit in memory, where the results may be used by subsequent dependent operations. In order to avoid the latency of the memory, the output for the operation unit may be routed directly back into the operation unit as a subsequent operand. Furthermore, one or more memory bypass registers are included such that the results produced by the operation unit during recent operations that have not yet satisfied the latency requirements of the memory are also available. A first memory bypass register may thus provide the result of an operation that completed one cycle earlier, a second memory bypass register may provide the result of an operation that completed two cycles earlier, etc.
摘要:
A lighting effect computation block and method therefore is presented. The lighting effect computation block separates lighting effect calculations for video graphics primitives into a number of simpler calculations that are performed in parallel but accumulated in an order-dependent manner. Each of the individual calculations is managed by a separate thread controller, where lighting effect calculations for a vertex of a primitive may be performed using a single parent light thread controller and a number of sub-light thread controllers. Each thread controller manages a thread of operation codes related to determination of the lighting parameters for the particular vertex. The thread controllers submit operation codes to an arbitration module based on the expected latency and interdependency between the various operation codes. The arbitration module determines which operation code is executed during a particular cycle, and provides that operation code to a computation engine. The computation engine performs calculations based on the operation code and stores results either in a memory or in an accumulation buffer corresponding to the particular vertex lighting effect block. In order to ensure that the order-dependent operations are properly performed, each of the sub-light thread controllers determines whether or not the accumulation operations for the preceding threads have been initiated before it submits its own final operation code that results in the performance of a subsequent accumulation operation.
摘要:
A method and apparatus for eliminating memory contention in a computation module is presented. The method includes, for a current operation being performed by a computation engine of the computation model, processing that begins by identifying one of a plurality of threads for which the current operation is being performed. The plurality of threads constitutes an application (e.g., geometric primitive applications, video graphic applications, drawing applications, etc.). The processing continues by identifying an operation code from a set of operation codes corresponding to the one of the plurality of threads. As such, the thread that has been identified for the current operation, one of its operation codes is being identified for the current operation. The processing then continues by determining a particular location of a particular one of a plurality of data flow memory devices based on the particular thread and the particular operation code for storing the result of the current operation. The processing then continues by producing a result for the current operation and storing the result at the particular location of the particular one of the data flow memory devices.
摘要:
A method and apparatus for supporting shared microcode in a multi-thread computation engine is presented. Each of a plurality of thread controllers controls a thread of a plurality of threads that are included in the system. Rather than storing the operation codes associated with their respective threads and providing those operation codes to an arbitration module for execution, each of the thread controller stores operation code identifiers that are submitted to the arbitration module. Once the arbitration module has determine which operation code should be executed, it passes the operation code identifiers corresponding to that operation code to a microcode generation block. The microcode generation block uses the operation code identifiers to generate a set of input parameters that are provided to a computation engine for execution, where the input parameters correspond to those for the operation code encoded by the operation code identifiers received by the microcode generation block.
摘要:
A configurable vertex blending circuit that allows both morphing and skinning operations to be supported in dedicated hardware is presented. Such a configurable vertex blending circuit includes a matrix array that is used for storing the matrices associated with the various portions of the vertex blending operations. Vertex data that is received is stored in an input vertex buffer that includes multiple position buffers such that the multiple positions associated with morphing operations can be stored. Similarly, the single position typically associated with skinning operations can be stored in one of the position buffers. The input vertex buffer further stores blending weights associated with the various component operations that are included in the overall vertex blending operation. An arithmetic unit, which is configured and controlled by a transform controller, performs the calculations required for each of a plurality of component operations included in the overall vertex blending operation. The results of each of these component operations are then combined to produce a blended vertex.
摘要:
A method and apparatus for arbitrating access to a computation engine includes processing that begins by determining, for a given clock cycle of the computation engine, whether at least one operation code is pending. When at least one operation code is pending, the processing continues by providing the operation code to the computation engine. When multiple operation codes are pending for the given clock cycle, the processing determines a priority operation code from the multiple pending operation codes based on an application specific prioritization scheme. The application specific prioritization scheme is dependent on the application and may include a two level prioritization scheme. At the first level the prioritization scheme prioritizes certain threads over other threads such that the throughput through the computation module is maximized. At the second level, the prioritization scheme prioritizes operation codes within a set of threads of equal priority based on the length of time the data for the operation codes has been in the processing pipeline. The processing then continues by shifting the remaining operation codes of the multiple operation codes to a subsequent clock cycle of the computation engine.
摘要:
A computation module and/or geometric engine for use in a video graphics processing circuit includes memory, a computation engine, a plurality of thread controllers, and an arbitration module. The computation engine is operably coupled to perform an operation based on an operation code and to provide a corresponding result to the memory as indicated by the operation code. Each of the plurality of thread controllers manages at least one corresponding thread of a plurality of threads. The plurality of threads constitutes an application. The arbitration module is coupled to the plurality of thread controllers and utilizes an application specific prioritization scheme to provide operation codes from the plurality of thread controllers to the computation engine such that idle time of the computation engine is minimized. The prioritization scheme prioritizes certain threads over other threads such that the throughput through the computation module is maximized.