Abstract:
An apparatus and method for handling tiny numbers using a super sticky bit are provided. In response to detecting that a preliminary result of an instruction corresponds to a tiny number and an underflow exception is masked, an execution pipeline can be configured to store a value corresponding to the preliminary result and a super sticky bit in a destination register. Also, a destination register tag corresponding to the destination register and a denormal exception indicator corresponding to the tiny number and masked underflow exception can be stored. A trap handler can be initiated to generate a corrected result for the instruction. The trap handler can detect that the denormal exception indicator has been set and can read the value and the super sticky bit from the destination register using the destination register tag. The trap handler can generate a corrected result for the instruction based on the value and the super sticky bit. An instruction subsequent to the trapping instruction can then be restarted.
Abstract:
A multiplier configured to obtain higher frequencies of exactly rounded results by adding an adjustment constant to intermediate products generated during iterative multiplication operations is disclosed. One such iterative multiplication operation is the Newton-Raphson iteration, which may be utilized by the multiplier to perform reciprocal calculations and reciprocal square root calculations. For each iteration, the results converge toward an infinitely precise result. To improve the frequency of the exactly rounded result, the results of the iterative calculations may be studied for a large number of differing input operands to determine the best suited value for the adjustment constant. The multiplier may also be configured to perform scalar and packed vector multiplication using the same hardware.
Abstract:
An execution unit configured to perform a plurality of arithmetic operations using the same set of operands. These operands include corresponding input vector values in each of a plurality of input registers. The execution unit is coupled to receive these input vector values, as well as an instruction value indicative of one of the plurality of arithmetic operations. In one embodiment, the plurality of arithmetic operations includes a vectored add instruction, a vectored subtract instruction, a vectored reverse subtract instruction, and an accumulate instruction. The vectored instructions perform arithmetic operations concurrently using corresponding values from each of the plurality of input registers. The accumulate instruction, however, is executable to add together all input values within a single input register. The execution unit further includes a multiplexer unit configured to selectively route the input vector values to a plurality of adder units according to the opcode value. In an embodiment in which the execution unit is configured to perform subtraction operations as well as addition, the multiplexer unit is additionally configured to selectively route negated versions (either one's or two's complement format) to the plurality of adder units. Each of the plurality of adder units is configured to generate a sum based upon the values conveyed from the multiplexer unit. The accumulate instruction advantageously allows important operations such as the matrix multiply to be performed rapidly. Because the matrix multiply is an integral part of many applications (particularly graphics applications), the accumulate instruction may lead to increased overall system performance.
Abstract:
A graphics processing unit is programmed to carry out cryptographic processing so that fast, effective cryptographic processing solutions can be provided without incurring additional hardware costs. The graphics processing unit can efficiently carry out cryptographic processing because it has an architecture that is configured to handle a large number of parallel processes. The cryptographic processing carried out on the graphics processing unit can be further improved by configuring the graphics processing unit to be capable of both floating point and integer operations.
Abstract:
The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.
Abstract:
A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated and used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur. The multiplier may also be configured to perform iterative calculations to evaluate constant powers of an operand. Intermediate products that are formed may be rounded and normalized in two paths and then compressed and stored for use in the next iteration. An adjustment constant may also be added to increase the frequency of exactly rounded results.
Abstract:
The use of checking instructions to detect special and exceptional cases of a defined data format in a microprocessor is disclosed. Generally speaking, a checking instruction is included with the microcode of floating-point instructions to detect special and exceptional cases of operand values for the floating-point instructions. A checking instruction is configured to set one or more flags in a flags register if it detects a special or exceptional case for an operand value. A checking instruction may also set the result or results of a floating-point instruction to a result value if a special or exceptional case is detected. In addition, a checking instruction may be configured to set one or more bits in status register if a special or exceptional case is detected. After a checking instruction completes execution, a subsequent microcode instruction can be executed to determine if one or more flags were set by the checking instruction. If one or more flags have been set by the checking instruction, the subsequent microcode instruction can branch to a non-sequential microcode instruction to handle the special or exceptional case detected by the checking instruction.
Abstract:
The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.
Abstract:
A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated and used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur. The multiplier may also be configured to perform iterative calculations to evaluate constant powers of an operand. Intermediate products that are formed may be rounded and normalized in two paths and then compressed and stored for use in the next iteration. An adjustment constant may also be added to increase the frequency of exactly rounded results.
Abstract:
One embodiment of the present invention sets forth a technique for performing fast integer division using commonly available arithmetic operations. The technique may be implemented in a two-stage process using a single-precision floating point reciprocal in conjunction with integer addition and multiplication. Furthermore, the technique may be fully pipelined on many conventional processors for performance that is comparable to the best available high-performance alternatives.