Abstract:
Embodiments of systems, methods, and apparatuses for monitoring address conflicts are described. In some embodiments, an apparatus includes execution circuitry to execute instructions; a plurality of registers to store data coupled to the execution circuitry; and performance monitoring circuitry to perform address conflict counting by at least determining address conflicts between an executing instruction and previously executed instructions and counting each instance of a conflict.
Abstract:
A method includes processing a sequence of instructions of program code that are specified using one or more architectural registers, by a hardware -implemented pipeline that renames the architectural registers in the instructions so as to produce operations specified using one or more physical registers (50), At least first and second segments of the sequence of instructions are selected, wherein the second segment occurs later in the sequence than the first segment. One or more of the architectural registers in the instructions of the second segment are renamed, before completing renaming the architectural registers in the instructions of the first segment, by pre-allocating one or more of the physical registers to one or more of the architectural registers.
Abstract:
A method for instruction signature based (ISB) speculative optimization includes storing a plurality of entries. Each entry of the plurality of entries includes an instruction signature tag and an ISB predictor effectiveness measurement. The instruction signature tag corresponds to an instruction signature and the ISB predictor effectiveness measurement is based, least in part, on an effectiveness of a predictor when applied to the instruction signature. The method also includes detecting a to-be-executed instruction signature and determining if the plurality of entries includes a matching entry. The matching entry has an instruction signature tag corresponding to the to-be-executed instruction signature. Upon determining that the plurality of entries includes the matching entry, the method includes controlling an application of the predictor to the to-be-executed instruction signature, based at least in part on the ISB predictor effectiveness measurement in the matching entry.
Abstract:
Systems, apparatuses, and methods related to a block-based processor core topology register are disclosed. In one example of the disclosed technology, a processor can include a plurality of block-based processor cores for executing a program including a plurality of instruction blocks. A respective block-based processor core can include a sharable resource and a programmable composition topology register. The programmable composition topology register can be used to assign a group of the physical processor cores that share the sharable resource.
Abstract:
Apparatus and methods are disclosed for controlling execution of memory access instructions in a block-based processor architecture using a hardware structure that generates a relative ordering of memory access instruction in an instruction block. In one example of the disclosed technology, a method of executing an instruction block having a plurality of memory load and/or memory store instructions includes decoding an instruction block encoding a plurality of memory access instructions and generating data indicating a relative order for executing the memory access instructions in the instruction block and scheduling operation of a portion of the instruction block based at least in part on the relative order data. In some examples, a store vector data register can store the generated relative ordering data for use in subsequent instances of the instruction block.
Abstract:
Different processor architectures are described to evaluate and track dependencies required by instructions. The processors may hold or queue instructions that require output of other instructions until required data and resources are available which may remove the requirement of NOPs in the instruction memory to resolve dependencies and pipeline hazards. The processor may divide instruction data into bundles for parallel execution and provide speculative execution. The processor may include various components to implement an evaluation unit, execution unit and termination unit.
Abstract:
Storing narrow produced values for instruction operands directly in a register map in an out-of-order processor (OoP) is provided. An OoP is provided that includes an instruction processing system. The instruction processing system includes a number of instruction processing stages configured to pipeline the processing and execution of instructions according to a dataflow execution. The instruction processing system also includes a register map table (RMT) configured to store address pointers mapping logical registers to physical registers in a physical register file (PRF) for storing produced data for use by consumer instructions without overwriting logical registers for later executed, out-of-order instructions. In certain aspects, the instruction processing system is configured to write back (i.e., store) narrow values produced by executed instructions directly into the RMT, as opposed to writing the narrow produced values into the PRF in a write back stage.
Abstract:
Systems, methods, and computer-readable storage are disclosed for providing early access to target addresses in block-based processor architectures. In one example of the disclosed technology, a method of performing a branch in a block-based architecture can include executing one or more instructions of a first instruction block using a first core of the block-based architecture. The method can include, before the first instruction block is committed, initiating non-speculative execution of instructions of a second instruction block.
Abstract:
A self-timed parallelized multi-core processor and method for operating the processor are provided. The processor has an instruction decoder unit to receive a program code instruction, determine an operating code and latency for the program code instructions, and assign a loop index to the program code instruction. The processor further includes an instruction decomposer unit coupled to the instruction decoder unit, the instruction decomposer configured to create a primitive by decomposing the instruction, replace the loop index with a core index, and broadcast the primitive. The processor further has a plurality of self-timed processing cores coupled to the instruction decomposer unit, each core having a unique core index and having a dispatch unit for comparing the core index in the primitive with the core index of its processing core, each core acting on the primitive when the index of the processing core is within a threshold of the core index.