COMPILER OPERATIONS FOR TENSOR STREAMING PROCESSOR

    公开(公告)号:US20230359584A1

    公开(公告)日:2023-11-09

    申请号:US18351916

    申请日:2023-07-13

    申请人: Groq, Inc.

    IPC分类号: G06F15/82 G06N20/00

    CPC分类号: G06F15/825 G06N20/00

    摘要: Embodiments are directed to a processor having a functional slice architecture. The processor is divided into tiles (or functional units) organized into a plurality of functional slices. The functional slices are configured to perform specific operations within the processor, which includes memory slices for storing operand data and arithmetic logic slices for performing operations on received operand data (e.g., vector processing, matrix manipulation). The processor includes a plurality of functional slices of a module type, each functional slice having a plurality of tiles. The processor further includes a plurality of data transport lanes for transporting data in a direction indicated in a corresponding instruction. The processor also includes a plurality of instruction queues, each instruction queue associated with a corresponding functional slice of the plurality of functional slices, wherein the instructions in the instruction queues comprise a functional slice specific operation code.

    POWER GRID DISTRIBUTION FOR TENSOR STREAMING PROCESSORS

    公开(公告)号:US20220365582A1

    公开(公告)日:2022-11-17

    申请号:US17732408

    申请日:2022-04-28

    申请人: Groq, Inc.

    发明人: JEFFREY WERNER

    IPC分类号: G06F1/3206 G06N20/00 G06F9/38

    摘要: Embodiments are directed to a power grid distribution for a deterministic processor. The deterministic processor includes a plurality of functional slices, a plurality of data transport lanes for transporting data across the functional slices along a first spatial dimension, and a plurality of instruction control units (ICUs). An instruction in each subset of the ICUs includes a functional slice specific operation code and is transported to a corresponding functional slice along a second spatial dimension orthogonal to the first spatial dimension. A power supply grid of metal traces is spread across the first and second spatial dimensions for supplying power to the functional slices and the ICUs. At least a portion of the metal traces are routed as discontinuous stubs along the first spatial dimension or the second spatial dimension.

    Tensor streaming processor architecture

    公开(公告)号:US11360934B1

    公开(公告)日:2022-06-14

    申请号:US17105976

    申请日:2020-11-27

    申请人: Groq, Inc.

    IPC分类号: G06F15/82 G06N20/00

    摘要: Embodiments are directed to a processor having a functional slice architecture. The processor is divided into tiles (or functional units) organized into a plurality of functional slices. The functional slices are configured to perform specific operations within the processor, which includes memory slices for storing operand data and arithmetic logic slices for performing operations on received operand data (e.g., vector processing, matrix manipulation). The processor includes a plurality of functional slices of a module type, each functional slice having a plurality of tiles. The processor further includes a plurality of data transport lanes for transporting data in a direction indicated in a corresponding instruction. The processor also includes a plurality of instruction queues, each instruction queue associated with a corresponding functional slice of the plurality of functional slices, wherein the instructions in the instruction queues comprise a functional slice specific operation code.

    Systems and Methods for Numerical Precision in Digital Multiplier Circuitry

    公开(公告)号:US20220075598A1

    公开(公告)日:2022-03-10

    申请号:US17351044

    申请日:2021-06-17

    申请人: Groq, Inc.

    IPC分类号: G06F7/544 G06F7/487 G06F5/01

    摘要: In one embodiment, multiplier circuitry multiplies operands of a first format. One or more storage register circuits store digital bits corresponding to an operand and another operand of the first format. A decomposing circuit decomposes the operand into a first plurality of operands, and the other operand into a second plurality of operands. Each multiplier circuit multiplies a respective first operand of the first plurality of operands with a respective second operand of the second plurality of operands to generate a corresponding partial result of a plurality of partial results. An accumulator circuit accumulates the plurality of partial results using a second format to generate a complete result of the second format that is stored in the accumulator circuit. A conversion circuit truncates the complete result of the second format and converts the truncated result into an output result of an output format.

    Processor compiler
    56.
    发明授权

    公开(公告)号:US11216734B1

    公开(公告)日:2022-01-04

    申请号:US16526922

    申请日:2019-07-30

    申请人: Groq, Inc.

    IPC分类号: G06N5/02 G06N20/00

    摘要: A system receives a predictive model and receives one or more runtime constraints. The system generates a directed acyclic graph (DAG) of the predictive model indicating dependencies. The system compiles the predictive model into first instructions for a first processor based on the one or more runtime constraints and the DAG. The system packages first instructions, the one or more runtime constraints, and the DAG of the predictive model in a first binary. The system recompiles the predictive model into second instructions for a second processor based on the runtime constraints and the DAG stored in the first processor. The system packages the second instructions, the DAG, and the runtime constraints in a second binary.

    Processor compiler
    57.
    发明授权

    公开(公告)号:US11210594B1

    公开(公告)日:2021-12-28

    申请号:US16526916

    申请日:2019-07-30

    申请人: Groq, Inc.

    IPC分类号: G06N5/02 G06N20/00

    摘要: A system receives a predictive model and receives one or more runtime constraints. The system generates a directed acyclic graph (DAG) of the predictive model indicating dependencies. The system compiles the predictive model into first instructions for a first processor based on the one or more runtime constraints and the DAG. The system packages first instructions, the one or more runtime constraints, and the DAG of the predictive model in a first binary. The system recompiles the predictive model into second instructions for a second processor based on the runtime constraints and the DAG stored in the first processor. The system packages the second instructions, the DAG, and the runtime constraints in a second binary.

    Circuits and methods for updating lookup tables

    公开(公告)号:US11165428B1

    公开(公告)日:2021-11-02

    申请号:US16932632

    申请日:2020-07-17

    申请人: Groq, Inc.

    摘要: The present disclosure provides circuits and methods that can be used to update configurations. An example circuit can include a plurality hLUTs and a plurality of registers configured to propagate a set of data or a portion thereof to the plurality of hLUTs. An hLUT of the plurality of hLUTs can have a transformation unit comprising transformation circuitry configured to (i) receive the set of data or the portion thereof from a register of the plurality of registers and (ii) transform the set of data or the portion thereof into configurations for the hLUT.

    Multiplier circuitry for multiplying operands of multiple data types

    公开(公告)号:US11042360B1

    公开(公告)日:2021-06-22

    申请号:US16986007

    申请日:2020-08-05

    申请人: Groq, Inc.

    IPC分类号: G06F7/487 G06F7/544 G06F5/01

    摘要: In one embodiment, in a first mode, first and second input operands having a first data type are multiplied using one or more of a plurality of multipliers, and in second mode, a plurality of input operands having a second data type are multiplied using the plurality of multipliers. Accordingly, multiplier circuitry may process different input data types and share circuitry across the different modes. In some embodiments, in the first mode, products may be converted to a third data type, and in the second mode, multiple products may be concatenated. Values in the third data type, in the first mode, and concatenated values having the second data type, in the second mode, may be added across different multimodal multipliers to form a multiply-accumulator. In some embodiments, the plurality of multiply-accumulators may be configured in series.

    LOADING OPERANDS AND OUTPUTTING RESULTS FROM A MULTI-DIMENSIONAL ARRAY USING ONLY A SINGLE SIDE

    公开(公告)号:US20210157767A1

    公开(公告)日:2021-05-27

    申请号:US17104465

    申请日:2020-11-25

    申请人: Groq, Inc.

    IPC分类号: G06F15/80 G06F9/38 G06F9/30

    摘要: A computational array is implemented in which all operands and results are loaded or output from a single side of the array. The computational array comprises a plurality of cells arranged in n rows and m columns, each configured to produce a processed value based upon a weight value and an activation value. The cells receive weight and activation values are received via colinear weight and activation transmission channels that each extend across a first side edge of the computational array to provide weight values and activations values to the cells of the array. In addition, result values produced at a top cell of each of the m columns of the array are routed through the array to be output from the same first side edge of the array at a same relative timing at which the result values were produced.