摘要:
Mechanisms for performing a complex matrix multiplication operation are provided. A vector load operation is performed to load a first vector operand of the complex matrix multiplication operation to a first target vector register. The first vector operand comprises a real and imaginary part of a first complex vector value. A complex load and splat operation is performed to load a second complex vector value of a second vector operand and replicate the second complex vector value within a second target vector register. The second complex vector value has a real and imaginary part. A cross multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the complex matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored in a result vector register.
摘要:
A methodology and implementation of a load-tagged pointer instruction for RISC based microarchitecture is presented. A first lower latency, speculative implementation reduces overall throughput latency for a microprocessor system by estimating the results of a particular instruction and confirming the integrity of the estimate a little slower than the normal instruction execution latency. A second higher latency, non-speculative implementation that always produces correct results is invoked by the first when the first guesses incorrectly. The methodologies and structures disclosed herein are intended to be combined with predictive techniques for instruction processing to ultimately improve processing throughput.
摘要:
A single register file may be addressed using both scalar and SIMD instructions. That is, subsets of registers within a multi-addressable register file according to the illustrative embodiments, are addressable with different instruction forms, e.g., scalar instructions, SIMD instructions, etc., while the entire set of registers may be addressed with yet another form of instructions, referred to herein as Vector-Scalar Extension (VSX) instructions. The operation set that may be performed on the entire set of registers using the VSX instruction form is substantially similar to that of the operation sets of the subsets of registers. Such an arrangement allows legacy instructions to access subsets of registers within the multi-addressable register file while new instructions, i.e. the VSX instructions, may access the entire range of registers within the multi-addressable register file.
摘要:
A design structure embodied in a machine readable medium used in a design process includes an apparatus for predictive decoding, the apparatus including register logic for fetching an instruction; predictor logic containing predictor information including prior instruction execution characteristics; logic for obtaining predictor information for the fetched instruction from the predictor; and decode logic for generating a selected one of a plurality of decode operation streams corresponding to the fetched instruction, wherein the decode operation stream is selected based on the predictor information.
摘要:
Copying characters of a set of terminated character data from one memory location to another memory location using parallel processing and without causing unwarranted exceptions. The character data to be copied is loaded within one or more vector registers. In particular, in one embodiment, an instruction (e.g., a Vector Load to block Boundary instruction) is used that loads data in parallel in a vector register to a specified boundary, and provides a way to determine the number of characters loaded. To determine the number of characters loaded (a count), another instruction (e.g., a Load Count to Block Boundary instruction) is used. Further, an instruction (e.g., a Vector Find Element Not Equal instruction) is used to find the index of the first delimiter character, i.e., the first termination character, such as a zero or null character within the character data. This instruction checks a plurality of bytes of data in parallel.
摘要:
The invention relates to implementing run-time instrumentation indirect sampling by address. An aspect of the invention includes a method for implementing run-time instrumentation indirect sampling by address. The method includes reading sample-point addresses from a sample-point address array, and comparing, by a processor, the sample-point addresses to an address associated with an instruction from an instruction stream executing on the processor. The method further includes recognizing a sample point upon execution of the instruction associated with the address matching one of the sample-point addresses. Run-time instrumentation information is obtained from the sample point. The method also includes storing the run-time instrumentation information in a run-time instrumentation program buffer as a reporting group.
摘要:
Embodiments of the invention relate to implementing run-time instrumentation indirect sampling by instruction operation code. An aspect of the invention includes reading sample-point instruction operation codes from a sample-point instruction array, and comparing, by a processor, the sample-point instruction operation codes to an operation code of an instruction from an instruction stream executing on the processor. A sample point is recognized upon execution of the instruction with the operation code matching one of the sample-point instruction operation codes. The run-time instrumentation information is obtained from the sample point. The run-time instrumentation information is stored in a run-time instrumentation program buffer as a reporting group.
摘要:
A method for performing predecode-time optimized instructions in conjunction with predecode time optimized instruction sequence caching. The method includes receiving a first instruction of an instruction sequence and a second instruction of the instruction sequence and determining if the first instruction and the second instruction can be optimized. In response to the determining that the first instruction and second instruction can be optimized, the method includes, preforming a pre-decode optimization on the instruction sequence and generating a new second instruction, wherein the new second instruction is not dependent on a target operand of the first instruction and storing a pre-decoded first instruction and a pre-decoded new second instruction in an instruction cache. In response to determining that the first instruction and second instruction can not be optimized, the method includes, storing the pre-decoded first instruction and a pre-decoded second instruction in the instruction cache.
摘要:
Embodiments relate to reducing a number of read ports for register pairs. An aspect includes executing an instruction. The instruction identifies a pair of registers as containing a wide operand which spans the pair of registers. It is determined if a pairing indicator associated with the pair of registers has a first value or a second value. The first value indicates that the wide operand is stored in a wide register, and the second value indicates that the wide operand is not stored in the wide register. Based on the pairing indicator having the first value, the wide operand is read from the wide register. Based on the pairing indicator having the second value, the wide operand is read from the pair of registers. An operation is performed using the wide operand.