Abstract:
In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
Abstract:
In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
Abstract:
An apparatus and method for superforwarding load operands in a microprocessor are provided. An execution unit in a microprocessor is configured to receive a load instruction and a subsequent instruction. If the load instruction corresponds to a simple load instruction, a destination operand of the load instruction can be superforwarded to a subsequent instruction if the subsequent instruction specifies a source operand that depends on the destination operand of the load instruction. The subsequent instruction is not required to wait until a load instruction executes or completes and can be scheduled and/or executed prior to or at the same time as the load instruction. Consequently, latencies associated with operand dependencies may be reduced.
Abstract:
A microprocessor configured to rapidly execute floating point store status word (FSTSW) type instructions that are immediately preceded by floating point compare (FCOM) type instructions is disclosed. FCOM-type instructions are modified to store their results to an architectural floating point status word and a temporary destination register. If an FSTSW-type instruction is detected immediately following an FCOM-type instruction, then the FSTSW-type instruction is transformed into a special fast floating point store status word (FSTSWEF) instruction. Unlike the FSTSW-type instruction, which is serializing and negatively impacts performance, the FSTSWEF instruction is not serializing and allows execution to continue without undue serialization. A computer system and method for rapidly executing FSTSW instructions immediately preceded by FCOM-type instructions are also disclosed.
Abstract:
An apparatus and method for executing floating-point store instructions in a microprocessor is provided. If store data of a floating-point store instruction corresponds to a tiny number and an underflow exception is masked, then a trap routine can be executed to generate corrected store data and complete the store operation. In response to detecting that store data corresponds to a tiny number and the underflow exception is masked, the store data, store address information, and opcode information can be stored prior to initiating the trap routine. The trap routine can be configured to access the store data, store address information, and opcode information. The trap routine can be configured to generate corrected store data and complete the store operation using the store data, store address information, and opcode information.
Abstract:
A multi-function look-up table for determining output values for predetermined ranges of a first mathematical function and a second mathematical function. In one embodiment, the multi-function look-up table is a bipartite look-up table including a first plurality of storage locations and a second plurality of storage locations. The first plurality of storage locations store base values for the first and second mathematical functions. Each base value is an output value (for either the first or second function) corresponding to an input region which includes the look-up table input value. The second plurality of storage locations, on the other hand, store difference values for both the first and second mathematical functions. These difference values are used for linear interpolation in conjunction with a corresponding base value in order to generate a look-up table output value. The multi-function look-up table further includes an address control unit coupled to receive a first input value and a signal which indicates whether an output value is to be generated for the first or second mathematical function. The address control unit then generates a first address value from these signals which is in turn conveyed to the first and second plurality of storage locations. In response to receiving the first address value, the first and second plurality of storage locations are configured to output a first base value and a first difference value, respectively. The first base value and first difference value are then conveyed to an output unit configured to generate a look-up table output value from the two values.
Abstract:
A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated and used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur. The multiplier may also be configured to perform iterative calculations to evaluate constant powers of an operand. Intermediate products that are formed may be rounded and normalized in two paths and then compressed and stored for use in the next iteration. An adjustment constant may also be added to increase the frequency of exactly rounded results.