Apparatuses, methods, and systems for a configurable accelerator having dataflow execution circuits

    公开(公告)号:US12086080B2

    公开(公告)日:2024-09-10

    申请号:US17033728

    申请日:2020-09-26

    CPC classification number: G06F13/1668 G06F13/4027

    Abstract: Systems, methods, and apparatuses relating to a configurable accelerator having dataflow execution circuits are described. In one embodiment, a hardware accelerator includes a plurality of dataflow execution circuits that each comprise a register file, a plurality of execution circuits, and a graph station circuit comprising a plurality of dataflow operation entries that each include a respective ready field that indicates when an input operand for a dataflow operation is available in the register file, and the graph station circuit is to select for execution a first dataflow operation entry when its input operands are available, and clear ready fields of the input operands in the first dataflow operation entry when a result of the execution is stored in the register file; a cross dependence network coupled between the plurality of dataflow execution circuits to send data between the plurality of dataflow execution circuits according to a second dataflow operation entry; and a memory execution interface coupled between the plurality of dataflow execution circuits and a cache bank to send data between the plurality of dataflow execution circuits and the cache bank according to a third dataflow operation entry.

    Apparatus and method for performing dual signed and unsigned multiplication of packed data elements

    公开(公告)号:US11809867B2

    公开(公告)日:2023-11-07

    申请号:US17027230

    申请日:2020-09-21

    Abstract: An apparatus and method for performing dual concurrent multiplications of packed data elements. For example one embodiment of a processor comprises: a decoder to decode a first instruction to generate a decoded instruction; a first source register to store a first plurality of packed byte data elements; a second source register to store a second plurality of packed byte data elements; execution circuitry to execute the decoded instruction, the execution circuitry comprising: multiplier circuitry to concurrently multiply each of the packed byte data elements of the first plurality with a corresponding packed byte data element of the second plurality to generate a plurality of products; adder circuitry to add specified sets of the products to generate temporary results for each set of products; zero-extension or sign-extension circuitry to zero-extend or sign-extend the temporary result for each set to generate an extended temporary result for each set; accumulation circuitry to combine each of the extended temporary results with a selected packed data value stored in a third source register to generate a plurality of final results; and a destination register to store the plurality of final results as a plurality of packed data elements in specified data element positions.

Patent Agency Ranking