摘要:
A processor includes an instruction fetch unit configured to issue instructions for execution, where the instructions are selected from a number of threads, where each given instruction has a corresponding thread identifier, and where at least some of the instructions specify operand(s) via register identifiers. A register file stores operands usable by the instructions, and may include several banks, each corresponding to a register identifiers and including several entries corresponding to the several threads, wherein the entries are configured to store data values. In response to receiving a request to read a particular register identifier for a given thread identifier, the register file may be configured to decode the given thread identifier to retrieve entries from the banks that correspond to the given thread identifier. The register file may further select, from among the retrieved entries, a data value corresponding to the particular register identifier to be output.
摘要:
A processor includes an instruction fetch unit configured to issue instructions for execution, where the instructions are selected from a number of threads, where each given instruction has a corresponding thread identifier, and where at least some of the instructions specify operand(s) via register identifiers. A register file stores operands usable by the instructions, and may include several banks, each corresponding to a register identifiers and including several entries corresponding to the several threads, wherein the entries are configured to store data values. In response to receiving a request to read a particular register identifier for a given thread identifier, the register file may be configured to decode the given thread identifier to retrieve entries from the banks that correspond to the given thread identifier. The register file may further select, from among the retrieved entries, a data value corresponding to the particular register identifier to be output.
摘要:
Systems and methods for efficient execution of operations in a multi-threaded processor. Each thread may include a blocking instruction. A blocking instruction blocks other threads from utilizing hardware resources for an appreciable amount of time. One example of a blocking type instruction is a Montgomery multiplication cryptographic instruction. Each thread can operate in a thread-based mode that allows the insertion of stall cycles during the execution of blocking instructions, during which other threads may utilize the previously blocked hardware resources. At times when multiple threads are scheduled to execute blocking instructions, the thread-based mode may be changed to increase throughput for these multiple threads. For example, the mode may be changed to disallow the insertion of stall cycles. Therefore, the time for sequential operation of the blocking instructions corresponding to the multiple threads may be reduced.
摘要:
A method and apparatus are disclosed for managing the execution of a floating-point store instruction within a data processing system including a memory and a superscalar processor having a number of floating-point registers (FPRs). According to the present invention, multiple instructions are dispatched for execution by the processor, including a floating-point store instruction having as an operand the content of a particular FPR. A determination is made whether the particular FPR is a destination register for results of a second instruction which precedes the store instruction in program order. If so, a determination is made whether the second instruction must complete before subsequent instructions can be successfully dispatched. In response to a determination that the second instruction must be completed prior to successfully dispatching subsequent instructions, the floating-point instruction is cancelled and redispatched after the completion of the second instruction. In response to a determination that the second instruction need not be completed prior to successfully dispatching subsequent instructions, execution of the floating-point store instruction is initiated by computing the destination address within memory into which the operand of the floating-point store instruction is to be stored, thereby minimizing the delay in executing a floating-point store instruction.
摘要:
A processor may include a hardware instruction fetch unit configured to issue instructions for execution, and a hardware functional unit configured to receive instructions for execution, where the instructions include cryptographic instruction(s) and non-cryptographic instruction(s). The functional unit may include a cryptographic execution pipeline configured to execute the cryptographic instructions with a corresponding cryptographic execution latency, and a non-cryptographic execution pipeline configured to execute the non-cryptographic instructions with a corresponding non-cryptographic execution latency that is longer than the cryptographic execution latency. The functional unit may further include a local bypass network configured to bypass results produced by the cryptographic execution pipeline to dependent cryptographic instructions executing within the cryptographic execution pipeline, such that each instruction within a sequence of dependent cryptographic instructions is executable with the cryptographic execution latency, and where the results of the cryptographic execution pipeline are not bypassed to any other functional unit within the processor.
摘要:
An apparatus and method to support pipelining of variable-latency instructions in a multithreaded processor. In one embodiment, a processor may include instruction fetch logic configured to issue a first and a second instruction from different ones of a plurality of threads during successive cycles. The processor may also include first and second execution units respectively configured to execute shorter-latency and longer-latency instructions and to respectively write shorter-latency or longer-latency instruction results to a result write port during a first or second writeback stage. The first writeback stage may occur a fewer number of cycles after instruction issue than the second writeback stage. The instruction fetch logic may be further configured to guarantee result write port access by the second execution unit during the second writeback stage by preventing the shorter-latency instruction from issuing during a cycle for which the first writeback stage collides with the second writeback stage.
摘要:
An apparatus for handling special cases outside of normal floating-point arithmetic functions is provided that is used in a floating-point unit used for calculating arithmetic functions. The floating-point unit generates an exponent portion and a mantissa portion and a writeback stage is coupled to the exponent portion and to the mantissa portion and is specifically used to handle the special cases outside the normal float arithmetic functions. A spill stage is also provided and is coupled to the writeback stage to receive a resultant exponent and mantissa. A register file unit is coupled to the writeback stage and the spill stage through a plurality of rename busses, which are used to carry results between the writeback stage and spill stage and the register file. The spill stage is serially coupled to the writeback stage so as to provide a smooth operation in the transition of operating on the results from the writeback stage for the exponent and mantissa. Each rename bus has a pair of tri-state buffers, one used to couple the rename bus to the writeback stage and the other used to couple the rename bus to the spill stage. The instruction dispatcher also provides location information for directing the results from the writeback stage and the spill stage before the result is completed.
摘要:
A processor may include a hardware instruction fetch unit configured to issue instructions for execution, and a hardware functional unit configured to receive instructions for execution, where the instructions include cryptographic instruction(s) and non-cryptographic instruction(s). The functional unit may include a cryptographic execution pipeline configured to execute the cryptographic instructions with a corresponding cryptographic execution latency, and a non-cryptographic execution pipeline configured to execute the non-cryptographic instructions with a corresponding non-cryptographic execution latency that is longer than the cryptographic execution latency. The functional unit may further include a local bypass network configured to bypass results produced by the cryptographic execution pipeline to dependent cryptographic instructions executing within the cryptographic execution pipeline, such that each instruction within a sequence of dependent cryptographic instructions is executable with the cryptographic execution latency, and where the results of the cryptographic execution pipeline are not bypassed to any other functional unit within the processor.
摘要:
Systems and methods for efficient instruction support of an multiple features for opcodes of an instruction set. A processor detects a fetched instruction of a computer program comprises an opcode corresponding to a plurality of functions. Each function corresponds to a different type of operation. The processor determines the received instruction corresponds to a feature requested by the computer program, such as a cryptographic algorithm. A determination is made as to whether hardware support exists for the feature. If hardware support exists for the feature, the instruction is executed on-chip by the hardware. Otherwise, software performs the operation corresponding to the instruction.
摘要:
Systems and methods for efficient execution of operations in a multi-threaded processor. Each thread may include a blocking instruction. A blocking instruction blocks other threads from utilizing hardware resources for an appreciable amount of time. One example of a blocking type instruction is a Montgomery multiplication cryptographic instruction. Each thread can operate in a thread-based mode that allows the insertion of stall cycles during the execution of blocking instructions, during which other threads may utilize the previously blocked hardware resources. At times when multiple threads are scheduled to execute blocking instructions, the thread-based mode may be changed to increase throughput for these multiple threads. For example, the mode may be changed to disallow the insertion of stall cycles. Therefore, the time for sequential operation of the blocking instructions corresponding to the multiple threads may be reduced.