PADDED VECTORIZATION WITH COMPILE TIME KNOWN MASKS

    公开(公告)号:US20200073662A1

    公开(公告)日:2020-03-05

    申请号:US16537460

    申请日:2019-08-09

    Abstract: A computing system includes a processing unit and a memory storing instructions that, when executed by the processor, cause the processor to receive program source code in a compiler, identify in the program source code a set of operations for vectorizing, where each operation in the set of operations specifies a set of one or more operands, in response to identifying the set of operations, vectorize the set of operations by, based on the number of operations in the set of operations and a total number of lanes in a first vector register, generating a mask indicating a first unmasked lane and a first masked lane in the first vector register, based on the mask, generating a set of one or more instructions for loading into the first unmasked lane a first operand of a first operation of the set of operations, and loading the first operand into the first masked lane.

    PROGRAM CODE OPTIMIZATION FOR REDUCING BRANCH MISPREDICTIONS

    公开(公告)号:US20180349140A1

    公开(公告)日:2018-12-06

    申请号:US15607883

    申请日:2017-05-30

    Abstract: Systems, apparatuses, and methods for implementing an IF2FOR transformation are disclosed. In one embodiment, a first group of instructions include an IF-statement and one or more control dependent instructions. The first group of instructions are transformed into a second group of instructions if the first group of instructions meet one or more criteria. In one embodiment, the criteria includes the (1) IF-statement being part of a loop and (2) the control dependent instructions not having any inter-loop iteration dependency. The second group of instructions are executable to (1) store results of the IF-statement condition for a first number of iterations and (2) execute the control dependent instructions for a second number of iterations when the IF-statement condition evaluates to true.

    Padded vectorization with compile time known masks

    公开(公告)号:US11789734B2

    公开(公告)日:2023-10-17

    申请号:US16537460

    申请日:2019-08-09

    Abstract: A computing system includes a processing unit and a memory storing instructions that, when executed by the processor, cause the processor to receive program source code in a compiler, identify in the program source code a set of operations for vectorizing, where each operation in the set of operations specifies a set of one or more operands, in response to identifying the set of operations, vectorize the set of operations by, based on the number of operations in the set of operations and a total number of lanes in a first vector register, generating a mask indicating a first unmasked lane and a first masked lane in the first vector register, based on the mask, generating a set of one or more instructions for loading into the first unmasked lane a first operand of a first operation of the set of operations, and loading the first operand into the first masked lane.

    Program code optimization for reducing branch mispredictions

    公开(公告)号:US10235173B2

    公开(公告)日:2019-03-19

    申请号:US15607883

    申请日:2017-05-30

    Abstract: Systems, apparatuses, and methods for implementing an IF2FOR transformation are disclosed. In one embodiment, a first group of instructions include an IF-statement and one or more control dependent instructions. The first group of instructions are transformed into a second group of instructions if the first group of instructions meet one or more criteria. In one embodiment, the criteria includes the (1) IF-statement being part of a loop and (2) the control dependent instructions not having any inter-loop iteration dependency. The second group of instructions are executable to (1) store results of the IF-statement condition for a first number of iterations and (2) execute the control dependent instructions for a second number of iterations when the IF-statement condition evaluates to true.

    Strided loading of non-sequential memory locations by skipping memory locations between consecutive loads

    公开(公告)号:US10353708B2

    公开(公告)日:2019-07-16

    申请号:US15273916

    申请日:2016-09-23

    Abstract: Systems, apparatuses, and methods for utilizing efficient vectorization techniques for operands in non-sequential memory locations are disclosed. A system includes a vector processing unit (VPU) and one or more memory devices. In response to determining that a plurality of vector operands are stored in non-sequential memory locations, the VPU performs a plurality of vector load operations to load the plurality of vector operands into a plurality of vector registers. Next, the VPU performs a shuffle operation to consolidate the plurality of vector operands from the plurality of vector registers into a single vector register. Then, the VPU performs a vector operation on the vector operands stored in the single vector register. The VPU can also perform a vector store operation by permuting and storing a plurality of vector operands in appropriate locations within multiple vector registers and then storing the vector registers to locations in memory using a mask.

Patent Agency Ranking