Abstract:
A secondary jump execution unit (JEU) is incorporated in a micro-processor to operate concurrently with a primary JEU, enabling the execution of simultaneous branch operations with possible detection of multiple branch mispredicts. When branch operations are executed on both JEUs in a same instruction cycle, mispredict processing for the secondary JEU is skidded into the primary JEU's dispatch pipeline such that the branch processing for the secondary JEU occurs after processing of the branch for the primary JEU and while the primary JEU is not processing a branch. Moreover, in cases when a nuke command is also received from a reorder buffer of the processor, the branch processing for the secondary JEU is further delayed to accommodate processing of the nuke on the primary JEU. Further embodiments support the promotion of the secondary JEU to have access to the mispredict mechanisms of the primary JEU in certain circumstances.
Abstract:
A scheduler implementing a dependency matrix having restricted entries is disclosed. A processing device of the disclosure includes a decode unit to decode an instruction and a scheduler communicably coupled to the decode unit. In one embodiment, the scheduler is configured to receive the decoded instruction, determine that the decoded instruction qualifies for allocation as a restricted reservation station (RS) entry type in a dependency matrix maintained by the scheduler, identify RS entries in the dependency matrix that are free for allocation, allocate one of the identified free RS entries with information of the decoded instruction in the dependency matrix, and update a row of the dependency matrix corresponding to the claimed RS entry with source dependency information of the decoded instruction.
Abstract:
A processor includes a processing unit including a storage module having stored thereon a table for tracking physical registers in which each store operation stores source data and a memory renaming module for register renaming load operations based on the table.
Abstract:
A processor includes an execution unit to execute instructions, where each operand of each executed instruction has one or more elements of an element size and at least one operand of the instruction corresponds to a register of a register size. The processor further includes a counter configured to count a number of instructions that have been executed by the execution unit associated with a particular combination of register size and element size.
Abstract:
Some implementations provide techniques and arrangements for adjusting a rate at which operations are performed by a processor based on a comparison of a first indication of power consumed by the processor as a result of performing a first set of operations and a second indication of power consumed by the processor as a result of performing a second set of operations. The rate at which operations are performed by the processor may be adjusted when the comparison indicates that a difference between the first indication of power consumed by the processor and the second indication of power consumed by the processor is greater than a threshold value.
Abstract:
A technique for using memory attributes to relay information to a program or other agent. More particularly, embodiments of the invention relate to using memory attribute bits to check various memory properties in an efficient manner.
Abstract:
A processor method and apparatus that allows for the overlapped execution of multiple iterations of a loop while allowing the compiler to include only a single copy of the loop body in the code while automatically managing which iterations are active. Since the prologue and epilogue are implicitly created and maintained within the hardware in the invention, a significant reduction in code size can be achieved compared to software-only modulo scheduling. Furthermore, loops with iteration counts less than the number of concurrent iterations present in the kernel are also automatically handled. This hardware enhanced scheme achieves the same performance as the fully-specified standard method. Furthermore, the hardware reduces the power requirement as the entire fetch unit can be deactivated for a portion of the loop's execution. The basic design of the invention involves including a plurality of buffers for storing loop instructions, each of which is associated with an instruction decoder and its respective functional unit, in the dispatch stage of a processor. Control logic is used to receive loop setup parameters and to control the selective issue of instructions from the buffers to the functional units.
Abstract:
A processor method and apparatus that allows for the overlapped execution of multiple iterations of a loop while allowing the compiler to include only a single copy of the loop body in the code while automatically managing which iterations are active. Since the prologue and epilogue are implicitly created and maintained within the hardware in the invention, a significant reduction in code size can be achieved compared to software-only modulo scheduling. Furthermore, loops with iteration counts less than the number of concurrent iterations present in the kernel are also automatically handled. This hardware enhanced scheme achieves the same performance as the fully-specified standard method. Furthermore, the hardware reduces the power requirement as the entire fetch unit can be deactivated for a portion of the loop's execution. The basic design of the invention involves including a plurality of buffers for storing loop instructions, each of which is associated with an instruction decoder and its respective functional unit, in the dispatch stage of a processor. Control logic is used to receive loop setup parameters and to control the selective issue of instructions from the buffers to the functional units.
Abstract:
A processing device implementing minimizing bandwidth to track return targets by an instruction tracing system is disclosed. A processing device of the disclosure an instruction fetch unit comprising a return stack buffer (RSB) to predict a target address of a return (RET) instruction corresponding to a call (CALL) instruction. The processing device further includes a retirement unit comprising an instruction tracing module to initiate instruction tracing for instructions executed by the processing device, determine whether the target address of the RET instruction was mispredicted, determine a value of call depth counter (CDC) maintained by the instruction tracing module, and when the target address of the RET instruction was not mispredicted and when the value of the CDC is greater than zero, generate an indication that the RET instruction branches to a next linear instruction after the corresponding CALL instruction.
Abstract:
Systems and methods for flag tracking in data manipulation operations involving move elimination. An example processing system comprises a first data structure including a plurality of physical register values; a second data structure including a plurality of pointers referencing elements of the first data structure; a third data structure including a plurality of move elimination sets, each move elimination set comprising two or more bits representing two or more logical data registers, the third data structure further comprising at least one bit associated with each move elimination set, the at least one bit representing one or more logical flag registers; a fourth data structure including an identifier of a data register sharing an element of the first data structure with a flag register; and a move elimination logic configured to perform a move elimination operation.