Abstract:
An apparatus and a method for accelerated processing of an arithmetic operation. The apparatus comprises an operand pre-arithmetic status register configured to generate a status notification that flags that one of predetermined combinatory conditions between a first operand and a second operand is met; and a modified arithmetic logic unit. The modified arithmetic logic unit comprises an electronic logic circuit configured to, in response to receiving the status notification from the operand pre-arithmetic status register, readdress execution of the arithmetic operation towards an expedited routine within the modified arithmetic logic unit if the status notification comprises one or more flags or to a conventional routine if the status notification is a blank status notification, the expedited routine having less calculation cycles to output an operation result than the conventional routine.
Abstract:
A processor unit for multiply and accumulate ("MAC") operations is provided, the processor unit comprising: a plurality of MAC units for performing a set of MAC operations, wherein each MAC unit of the plurality of MAC units including an execution unit and a one-write one-read ("1W/1R") register file, wherein the 1W/1R register file having at least one accumulator; and another register file, wherein the execution unit of each MAC unit being configured to perform a subset of MAC operations by computing a product of a set of values received from the another register file and adding the computed product to a content of the at least one accumulator, wherein each MAC unit being configured to perform the subset of MAC operations in a single clock cycle.
Abstract:
An apparatus and method are described for performing a vector reduction. For example, an apparatus according to one embodiment comprises: a reduction logic tree comprised of a set of N- l reduction logic blocks used to perform reduction in a single operation cycle for N vector elements; a first input vector register storing a first input vector communicatively Coupled to the set of reduction logic blocks; a second input vector register storing a second input vector communicatively coupled to the set of reduction logic blocks; a mask register storing a mask value controlling a set of one or more multiplexers, each of the set of multiplexers selecting a value directly from the first input vector register or an output containing a processed value from one of the reduction logic blocks; and an output vector register coupled to outputs of the one or more multiplexers to receive values output passed through by each of the multiplexers responsive to the control signals.
Abstract:
A method is described that includes reading a first read mask from a first register. The method also includes reading a first vector operand from a second register or memory location. The method also includes applying the read mask against the first vector operand to produce a set of elements for operation. The method also includes performing an operation of the set elements. The method also includes creating an output vector by producing multiple instances of the operation's result. The method also includes reading a first write mask from a third register, the first write mask being different than the first read mask. The method also includes applying the write mask against the output vector to create a resultant vector. The method also includes writing the resultant vector to a destination register.
Abstract:
An apparatus is described that includes a semiconductor chip having an instruction execution pipeline having one or more execution units with respective logic circuitry to: a) execute a first instruction that multiplies a first input operand and a second input operand and presents a lower portion of the result, where, the first and second input operands are respective elements of first and second input vectors; b) execute a second instruction that multiplies a first input operand and a second input operand and presents an upper portion of the result, where, the first and second input operands are respective elements of first and second input vectors; and, c) execute an add instruction where a carry term of the add instruction's adding is recorded in a mask register.
Abstract:
A processing core is described having execution unit logic circuitry having a first register to store a first vector input operand, a second register to a store a second vector input operand and a third register to store a packed data structure containing scalar input operands a, b, c. The execution unit logic circuitry further include a multiplier to perform the operation (a*(first vector input operand)) + (b*(second vector operand)) + c.
Abstract:
A microprocessor (10) comprises at least one general-purpose-register (12) arranged to store and provide a number of destination bits to a multiply unit (14); a control unit (18) adapted to provide at least a multiply-high instruction (20) and a multiply-high- and- accumulate instruction (22) to the multiply unit. The multiply unit is further arranged to receive at least a first and a second source operand (24, 26), each having an associated number of source bits and a sum of the associated numbers of source bits exceeding the number of destination bits, connected to a register-extension cache (28) comprising at least one cache entry arranged to store and provide a number of precision-enhancement bits, and adapted to store a destination portion of a result operand in the general-purpose- register and a precision-enhancement portion of the result operand in the cache entry. The result operand is generated by a multiply-high operation when or by a multiply-high-and-accumulate operation depending on the recieved instruction.