HETEROGENEOUS HARDWARE ACCELERATOR ARCHITECTURE FOR PROCESSING SPARSE MATRIX DATA WITH SKEWED NON-ZERO DISTRIBUTIONS

    公开(公告)号:US20180189239A1

    公开(公告)日:2018-07-05

    申请号:US15396513

    申请日:2016-12-31

    CPC classification number: G06F17/16 G06F9/3001 G06F9/30036 H03M7/30

    Abstract: Heterogeneous hardware accelerator architectures for processing sparse matrix data having skewed non-zero distributions are described. An accelerator includes sparse tiles to access data from a first memory over a high bandwidth interface and very/hyper sparse tiles to randomly access data from a second memory over a low-latency interface. The accelerator determines that one or more computational tasks involving a matrix are to be performed, partitions the matrix into a first plurality of blocks that includes one or more sparse sections of the matrix, and a second plurality of blocks that includes sections of the matrix that are very- or hyper-sparse. The accelerator causes the sparse tile(s) to perform one or more matrix operations for the computational task(s) using the first plurality of blocks and further causes the very/hyper sparse tile(s) to perform the one or more matrix operations for the computational task(s) using the second plurality of blocks.

    HARDWARE ACCELERATOR ARCHITECTURE FOR PROCESSING VERY-SPARSE AND HYPER-SPARSE MATRIX DATA

    公开(公告)号:US20180189234A1

    公开(公告)日:2018-07-05

    申请号:US15396511

    申请日:2016-12-31

    Abstract: An accelerator architecture for processing very-sparse and hyper-sparse matrix data is disclosed. A hardware accelerator comprises one or more tiles, each including a plurality of processing elements (PEs) and a data management unit (DMU). The PEs are to perform matrix operations involving very- or hyper-sparse matrices that are stored by a memory. The DMU is to provide the plurality of PEs access to the memory via an interface that is optimized to provide low-latency, parallel, random accesses to the memory. The PEs, via the DMU, perform the matrix operations by, issuing random access read requests for values of the one or more matrices, issuing random access read requests for values of one or more vectors serving as a second operand, and issuing random access write requests for values of one or more vectors serving as a result.

    COMPUTE ENGINE ARCHITECTURE TO SUPPORT DATA-PARALLEL LOOPS WITH REDUCTION OPERATIONS

    公开(公告)号:US20180189110A1

    公开(公告)日:2018-07-05

    申请号:US15396510

    申请日:2016-12-31

    Abstract: Techniques involving a compute engine architecture to support data-parallel loops with reduction operations are described. In some embodiments, a hardware processor includes a memory unit and a plurality of processing elements (PEs). Each of the PEs is directly coupled via one or more neighbor-to-neighbor links with one or more neighboring PEs so that each PE can receive a value from a neighboring PE, provide a value to a neighboring PE, or both receive a value from one neighboring PE and also provide a value to another neighboring PE. The hardware processor also includes a control engine coupled with the plurality of PEs that is to cause the plurality of PEs to collectively perform a task to generate one or more output values by each performing one or more iterations of a same subtask of the task.

    HARDWARE ACCELERATOR TEMPLATE AND DESIGN FRAMEWORK FOR IMPLEMENTING RECURRENT NEURAL NETWORKS

    公开(公告)号:US20220121917A1

    公开(公告)日:2022-04-21

    申请号:US17563379

    申请日:2021-12-28

    Abstract: Hardware accelerator templates and design frameworks for implementing recurrent neural networks (RNNs) and variants thereof are described. A design framework module obtains a flow graph for an RNN algorithm. The flow graph identifies operations to be performed to implement the RNN algorithm and further identifies data dependencies between ones of the operations. The operations include matrix operations and vector operations. The design framework module maps the operations of the flow graph to an accelerator hardware template, yielding an accelerator instance comprising register transfer language code that describes how one or more matrix processing units and one or more vector processing units are to be arranged to perform the RNN algorithm. At least one of the one or more MPUs, as part of implementing the RNN algorithm, is to directly provide or directly receive a value from one of the one or more VPUs.

    HARDWARE ACCELERATOR ARCHITECTURE AND TEMPLATE FOR WEB-SCALE K-MEANS CLUSTERING

    公开(公告)号:US20180189675A1

    公开(公告)日:2018-07-05

    申请号:US15396515

    申请日:2016-12-31

    CPC classification number: G06N20/00 G06F16/2237 G06F16/285 G06F17/16

    Abstract: Hardware accelerator architectures for clustering are described. A hardware accelerator includes sparse tiles and very/hyper sparse tiles. The sparse tile(s) execute operations for a clustering task involving a matrix. Each sparse tile includes a first plurality of processing units to operate upon a first plurality of blocks of the matrix that have been streamed to one or more random access memories of the sparse tiles over a high bandwidth interface from a first memory unit. Each of the very/hyper sparse tiles are to execute operations for the clustering task involving the matrix. Each of the very/hyper sparse tiles includes a second plurality of processing units to operate upon a second plurality of blocks of the matrix that have been randomly accessed over a low-latency interface from a second memory unit.

    HARDWARE ACCELERATOR TEMPLATE AND DESIGN FRAMEWORK FOR IMPLEMENTING RECURRENT NEURAL NETWORKS

    公开(公告)号:US20180189638A1

    公开(公告)日:2018-07-05

    申请号:US15396520

    申请日:2016-12-31

    CPC classification number: G06N3/063 G06N3/0445

    Abstract: Hardware accelerator templates and design frameworks for implementing recurrent neural networks (RNNs) and variants thereof are described. A design framework module obtains a flow graph for an RNN algorithm. The flow graph identifies operations to be performed to implement the RNN algorithm and further identifies data dependencies between ones of the operations. The operations include matrix operations and vector operations. The design framework module maps the operations of the flow graph to an accelerator hardware template, yielding an accelerator instance comprising register transfer language code that describes how one or more matrix processing units and one or more vector processing units are to be arranged to perform the RNN algorithm. At least one of the one or more MPUs, as part of implementing the RNN algorithm, is to directly provide or directly receive a value from one of the one or more VPUs.

Patent Agency Ranking