Abstract:
Embodiments of systems, apparatuses, and methods of performing in a computer processor dependency index vector calculation in response to an instruction that includes a first and second source writemask register operands, a destination vector register operand, and an opcode are described.
Abstract:
Methods and systems to convert a scalar computer program loop having loop-carried dependences into a vector computer program loop are disclosed. One such method includes, replacing the scalar recurrence operation in the scalar computer program loop with a first vector summing operation and a first vector recurrence operation. The first vector summing operation is to generate a first running sum and the first vector recurrence operation is to generate a first vector. In some examples, the first vector recurrence operation is based on the scalar recurrence operation. Disclosed methods also include inserting: 1) a renaming operation to rename the first vector, 2) a second vector summing operation that is to generate a second running sum; and 3) a second vector recurrence operation to generate a second vector based on the renamed first vector.
Abstract:
Methods and systems to convert a scalar computer program loop having loop-carried dependences into a vector computer program loop are disclosed. One such method includes, at runtime, identifying, by executing an instruction with one or more processors, a first loop iteration that cannot be executed in parallel with a second loop iteration due to a set of conflicting scalar loop operations. The first loop iteration is executed after the second loop iteration. The method also includes sectioning, by executing an instruction with one or more processors, a vector loop into vector partitions including a first vector partition. The first vector partition executes consecutive loop iterations in parallel and the consecutive loop iterations start at the second loop iteration and end before the first loop iteration.
Abstract:
A vector reduction instruction with non-unit strided access pattern is received and executed by the execution circuitry of a processor. In response to the instruction, the execution circuitry performs an associative reduction operation on data elements of a first vector register. Based on values of the mask register and a current element position being processed, the execution circuitry sequentially sets one or more data elements of the first vector register to a result, which is generated by the associative reduction operation applied to both a previous data element of the first vector register and a data clement of a third vector register. The previous data element is located more than one element position away from the current element position.
Abstract:
Loop vectorization methods and apparatus are disclosed. An example method includes prior to executing an original loop having iterations, analyzing, via a processor, the iterations of the original loop, identifying a dependency between a first one of the iterations of the original loop and a second one of the iterations of the original loop, after identifying the dependency, vectorizing a first group of the iterations of the original loop based on the identified dependency to form a vectorization loop, and setting a dynamic adjustment value of the vectorization loop based on the identified dependency.
Abstract:
Methods and systems to convert a scalar computer program loop having loop-carried dependences into a vector computer program loop are disclosed. One such method includes, replacing the scalar recurrence operation in the scalar computer program loop with a first vector summing operation and a first vector recurrence operation. The first vector summing operation is to generate a first running sum and the first vector recurrence operation is to generate a first vector. In some examples, the first vector recurrence operation is based on the scalar recurrence operation. Disclosed methods also include inserting: 1) a renaming operation to rename the first vector, 2) a second vector summing operation that is to generate a second running sum; and 3) a second vector recurrence operation to generate a second vector based on the renamed first vector.
Abstract:
Loop vectorization methods and apparatus are disclosed. An example method includes setting a dynamic adjustment value of a vectorization loop; executing the vectorization loop to vectorize a loop by grouping iterations of the loop into one or more vectors; identifying a dependency between iterations of the loop as; and setting the dynamic adjustment value based on the identified dependency.
Abstract:
Methods and systems to convert a scalar computer program loop having loop-carried dependences into a vector computer program loop are disclosed. One such method includes, at runtime, identifying, by executing an instruction with one or more processors, a first loop iteration that cannot be executed in parallel with a second loop iteration due to a set of conflicting scalar loop operations. The first loop iteration is executed after the second loop iteration. The method also includes sectioning, by executing an instruction with one or more processors, a vector loop into vector partitions including a first vector partition. The first vector partition executes consecutive loop iterations in parallel and the consecutive loop iterations start at the second loop iteration and end before the first loop iteration.
Abstract:
Embodiments of systems, apparatuses, and methods for performing in a computer processor generation of a predicate mask based on vector comparison in response to a single instruction are described.
Abstract:
Loop vectorization methods and apparatus are disclosed. An example method includes prior to executing an original loop having iterations, analyzing, via a processor, the iterations of the original loop, identifying a dependency between a first one of the iterations of the original loop and a second one of the iterations of the original loop, after identifying the dependency, vectorizing a first group of the iterations of the original loop based on the identified dependency to form a vectorization loop, and setting a dynamic adjustment value of the vectorization loop based on the identified dependency.