摘要:
A result normalizer (58) for use with an adder (56) generates a mask in two stages that indicates the location of the leading one in the adder result. In the first stage, a leading zero anticipator (68) determines the position to within two digits. In the second stage, a count leading zero indicator (70) determines the position to a single digit. The mask is used to control the number of digits that each stage of a multiplexer array (66) shifts the adder result. The output of the multiplexer array thereby contains a leading one. The result normalizer may be advantageously used in high performance applications such as in a floating point execution unit in a data processor or in digital signal processing systems.
摘要:
A mechanism for handling unfused multiply-add accrued exception bits includes a processor including a floating point unit, a storage, and exception logic. The floating-point unit may be configured to execute an unfused multiply-accumulate instruction defined with the instruction set architecture (ISA). The unfused multiply-accumulate instruction may include a multiply sub-operation and an accumulate sub-operation. The storage may be configured to maintain floating-point exception state information. The exception logic may be configured to capture the floating-point exception state after completion of the multiply sub-operation and prior to completion of the accumulate sub-operation, for example, and to update the storage to reflect the floating-point exception state.
摘要:
A processing unit, system, and method for performing a multiply operation in a multiply-add pipeline. To reduce the pipeline latency, the unrounded result of a multiply-add operation is bypassed to the inputs of the multiply-add pipeline for use in a subsequent operation. If it is determined that rounding is required for the prior operation, then the rounding will occur during the subsequent operation. During the subsequent operation, a Booth encoder not utilized by the multiply operation will output a rounding correction factor as a selection input to a Booth multiplexer not utilized by the multiply operation. When the Booth multiplexer receives the rounding correction factor, the Booth multiplexer will output a rounding correction value to a carry save adder (CSA) tree, and the CSA tree will generate the correct sum from the rounding correction value and the other partial products.
摘要:
Techniques are disclosed relating to a processor including instruction support for performing a Montgomery multiplication. The processor may issue, for execution, programmer-selectable instruction from a defined instruction set architecture (ISA). The processor may include an instruction execution unit configured to receive instructions including a first instance of a Montgomery-multiply instruction defined within the ISA. The Montgomery-multiply instruction is executable by the processor to operate on at least operands A, B, and N residing in respective portions of a general-purpose register file of the processor, where at least one of operands A, B, N spans at least two registers of general-purpose register file. The instruction execution unit is configured to calculate P mod N in response to receiving the first instance of the Montgomery-multiply instruction, where P is the product of at least operand A, operand B, and R^−1.
摘要翻译:公开了涉及包括用于执行蒙哥马利乘法的指令支持的处理器的技术。 处理器可以从定义的指令集架构(ISA)发出执行编程器可选择的指令。 处理器可以包括指令执行单元,其被配置为接收包括在ISA内定义的蒙哥马利乘法指令的第一实例的指令。 蒙哥马利乘法指令可由处理器执行,以至少驻留在处理器的通用寄存器文件的相应部分中的操作数A,B和N操作,其中操作数A,B,N中的至少一个跨越 最少两个通用寄存器寄存器。 指令执行单元被配置为响应于接收到蒙哥马利乘法指令的第一实例来计算P mod N,其中P是至少操作数A,操作数B和R ^ -1的乘积。
摘要:
Systems and methods for efficient execution of operations in a multi-threaded processor. Each thread may include a blocking instruction. A blocking instruction blocks other threads from utilizing hardware resources for an appreciable amount of time. One example of a blocking type instruction is a Montgomery multiplication cryptographic instruction. Each thread can operate in a thread-based mode that allows the insertion of stall cycles during the execution of blocking instructions, during which other threads may utilize the previously blocked hardware resources. At times when multiple threads are scheduled to execute blocking instructions, the thread-based mode may be changed to increase throughput for these multiple threads. For example, the mode may be changed to disallow the insertion of stall cycles. Therefore, the time for sequential operation of the blocking instructions corresponding to the multiple threads may be reduced.
摘要:
Implementing an unfused multiply-add instruction within a fused multiply-add pipeline. The system may include an aligner having an input for receiving an addition term, a multiplier tree having two inputs for receiving a first value and a second value for multiplication, and a first carry save adder (CSA), wherein the first CSA may receive partial products from the multiplier tree and an aligned addition term from the aligner. The system may include a fused/unfused multiply add (FUMA) block which may receive the first partial product, the second partial product, and the aligned addition term, wherein the first partial product and the second partial product are not truncated. The FUMA block may perform an unfused multiply add operation or a fused multiply add operation using the first partial product, the second partial product, and the aligned addition term, e.g., depending on an opcode or mode bit.
摘要:
Systems and methods for efficient instruction support of an multiple features for opcodes of an instruction set. A processor detects a fetched instruction of a computer program comprises an opcode corresponding to a plurality of functions. Each function corresponds to a different type of operation. The processor determines the received instruction corresponds to a feature requested by the computer program, such as a cryptographic algorithm. A determination is made as to whether hardware support exists for the feature. If hardware support exists for the feature, the instruction is executed on-chip by the hardware. Otherwise, software performs the operation corresponding to the instruction.
摘要:
Techniques are disclosed relating to a processor including instruction support for performing a Montgomery multiplication. The processor may issue, for execution, programmer-selectable instruction from a defined instruction set architecture (ISA). The processor may include an instruction execution unit configured to receive instructions including a first instance of a Montgomery-multiply instruction defined within the ISA. The Montgomery-multiply instruction is executable by the processor to operate on at least operands A, B, and N residing in respective portions of a general-purpose register file of the processor, where at least one of operands A, B, N spans at least two registers of general-purpose register file. The instruction execution unit is configured to calculate P mod N in response to receiving the first instance of the Montgomery-multiply instruction, where P is the product of at least operand A, operand B, and R̂−1.
摘要翻译:公开了涉及包括用于执行蒙哥马利乘法的指令支持的处理器的技术。 处理器可以从定义的指令集架构(ISA)发出执行编程器可选择的指令。 处理器可以包括指令执行单元,其被配置为接收包括在ISA内定义的蒙哥马利乘法指令的第一实例的指令。 蒙哥马利乘法指令可由处理器执行,以至少驻留在处理器的通用寄存器文件的相应部分中的操作数A,B和N操作,其中操作数A,B,N中的至少一个跨越 最少两个通用寄存器寄存器。 指令执行单元被配置为响应于接收到蒙哥马利乘法指令的第一实例来计算P mod N,其中P是至少操作数A,操作数B和R-1的乘积。
摘要:
In one embodiment, a multithreaded processor includes a multithreaded instruction source that may provide a plurality of instructions each corresponding to a respective one of a plurality of threads. The multithreaded processor also includes a pick unit coupled to the multithreaded instruction source. The pick unit may select in a given cycle, a first divide instruction corresponding to one thread of the plurality of threads and a second divide instruction corresponding to another thread of the plurality of threads based upon a thread selection algorithm. Further, the multithreaded processor includes a storage coupled to a functional unit including a divider configured to execute the first divide instruction and the second divide instruction. The storage may store one of the first and the second divide instructions during execution of the other of the first and the second divide instructions.
摘要:
A processor including instruction support for implementing large-operand multiplication may issue, for execution, programmer-selectable instructions from a defined instruction set architecture (ISA). The processor may include an instruction execution unit comprising a hardware multiplier datapath circuit, where the hardware multiplier datapath circuit is configured to multiply operands having a maximum number of bits M. In response to receiving a single instance of a large-operand multiplication instruction defined within the ISA, wherein at least one of the operands of the large-operand multiplication instruction includes more than the maximum number of bits M, the instruction execution unit is configured to multiply operands of the large-operand multiplication instruction within the hardware multiplier datapath circuit to determine a result of the large-operand multiplication instruction without execution of programmer-selected instructions within the ISA other than the large-operand multiplication instruction.