SIMD implementation of stencil codes

    公开(公告)号:US09910826B2

    公开(公告)日:2018-03-06

    申请号:US14670684

    申请日:2015-03-27

    Abstract: Implementing a 1D stencil code via SIMD instructions on a computer with vector registers having N processing elements (PEs), among them a set of coefficient vector registers, a set of at most N data vector registers, and a set of result vector registers. The M stencil coefficients are loaded in a particular pattern into M+N−1 coefficient vector registers. Successive sets of N consecutive data values are received, and each data value of a set is loaded into all PEs of a data vector register of the set of data vector registers. The result vector registers accumulate sums of products of consecutive coefficient vector registers with corresponding data vector registers. The contents of any result vector register containing a sum of all coefficient vector register-data vector register products is output, and the result vector register is reused for accumulating.

    Instructions and Logic for Load-Indices-and-Prefetch-Gathers Operations

    公开(公告)号:US20170177349A1

    公开(公告)日:2017-06-22

    申请号:US14977356

    申请日:2015-12-21

    Abstract: A processor includes an execution unit to execute instructions to load indices from an array of indices, optionally perform a gather, and prefetch (to a specified cache) elements for a future gather from arbitrary locations in memory. The execution unit includes logic to load, for each element to be gathered or prefetched, an index value to be used in computing the address in memory for the element. The index value may be retrieved from an array of indices that is identified for the instruction. The execution unit includes logic to compute the address based on the sum of a base address that is specified for the instruction and the index value that was retrieved for the data element, with or without scaling. The execution unit includes logic to store gathered data elements in contiguous locations in a destination vector register that is specified for the instruction.

    MEMORY CONTROLLER AND SIMD PROCESSOR
    18.
    发明申请

    公开(公告)号:US20170147529A1

    公开(公告)日:2017-05-25

    申请号:US15423868

    申请日:2017-02-03

    Inventor: Shorin KYO

    CPC classification number: G06F15/8007 G06F15/8015

    Abstract: Technology to suppress the drop in SIMD processor efficiency that occurs when exchanging two-dimensional data in a plurality of rectangular regions, between an external section and a plurality of processor elements in an SIMD processor, so that one rectangular region corresponds to one processor element. In the SIMD processor, an address storage unit in a memory controller is capable of setting N number of addresses Ai (i=1 through N) in an external memory by utilizing a control processor. A parameter storage unit is capable of setting a first parameter OSV, a second parameter W, and a third parameter L by utilizing a control processor. A data transfer unit executes the transfer of data between an external memory, and the buffers in N number of processor elements contained in the applicable SIMD processor, based on the contents of the address storage unit and the parameter storage unit.

    Apparatus and method for efficient prefix sum operation

    公开(公告)号:US09632979B2

    公开(公告)日:2017-04-25

    申请号:US14727826

    申请日:2015-06-01

    Abstract: An apparatus and method are described for performing a prefix sum. For example, one embodiment of an apparatus comprises: a graphics processor unit comprising one or more execution units to execute single instruction multiple data (SIMD) instructions, the GPU to be provided with a plurality of data elements as input for a prefix sum operation; a first register of the GPU to store the plurality of data elements in specified data element positions; and the one or more execution units to perform a series of single instruction multiple data (SIMD) operations using the plurality of data elements, the SIMD operations performed using regioning techniques to generate the prefix sum, the SIMD operations including a first plurality of simultaneous addition operations to add specified data elements to generate intermediate results and further including a second plurality of simultaneous addition operations to add the intermediate results to other intermediate results to generate the prefix sum.

Patent Agency Ranking