PROCESSOR ARCHITECTURE
    1.
    发明公开

    公开(公告)号:US20240176737A1

    公开(公告)日:2024-05-30

    申请号:US18394442

    申请日:2023-12-22

    申请人: Groq, Inc.

    摘要: A processor having a functional slice architecture is divided into a plurality of functional units (“tiles”) organized into a plurality of slices. Each slice is configured to perform specific functions within the processor, which may include memory slices (MEM) for storing operand data, and arithmetic logic slices for performing operations on received operand data. The tiles of the processor are configured to stream operand data across a first dimension, and receive instructions across a second dimension orthogonal to the first dimension. The timing of data and instruction flows are configured such that corresponding data and instructions are received at each tile with a predetermined temporal relationship, allowing operand data to be transmitted between the slices of the processor without any accompanying metadata. Instead, each slice is able to determine what operations to perform on received data based upon the timing at which the data is received.

    Processor instruction dispatch configuration

    公开(公告)号:US11868804B1

    公开(公告)日:2024-01-09

    申请号:US16951938

    申请日:2020-11-18

    申请人: Groq, Inc.

    摘要: A processor comprises a computational array of computational elements and an instruction dispatch circuit. The computational elements receive data operands via data lanes extending along a first dimension, and processes the operands based upon instructions received from the instruction dispatch circuit via instruction lanes extending along a second dimension. The instruction dispatch circuit receives raw instructions, and comprises an instruction dispatch unit (IDU) processor that processes a set of raw instructions to generate processed instructions for dispatch to the computational elements, where the number of processed instructions is not equal to the number of instructions of the set of raw instructions. The processed instructions are dispatched to columns of the computational array via a plurality of instruction queues, wherein an instruction vector of instructions is shifted between adjacent instruction queues in a first direction, and dispatches instructions to the computational elements in a second direction.

    Memory design for a processor
    4.
    发明授权

    公开(公告)号:US11868250B1

    公开(公告)日:2024-01-09

    申请号:US17582895

    申请日:2022-01-24

    申请人: Groq, Inc.

    摘要: A processor having a functional slice architecture is divided into a plurality of functional units (“tiles”) organized into a plurality of slices. Each slice is configured to perform specific functions within the processor, which may include memory slices (MEM) for storing operand data, and arithmetic logic slices for performing operations on received operand data. The tiles of the processor are configured to stream operand data across a first dimension, and receive instructions across a second dimension orthogonal to the first dimension. The timing of data and instruction flows are configured such that corresponding data and instructions are received at each tile with a predetermined temporal relationship, allowing operand data to be transmitted between the slices of the processor without any accompanying metadata. Instead, each slice is able to determine what operations to perform on received data based upon the timing at which the data is received.

    Spatial locality transform of matrices

    公开(公告)号:US11537687B2

    公开(公告)日:2022-12-27

    申请号:US16686870

    申请日:2019-11-18

    申请人: Groq, Inc.

    摘要: A method comprises accessing a flattened input stream that includes a set of parallel vectors representing a set of input values of a kernel-sized tile of an input tensor that is to be convolved with a kernel. An expanded kernel is received that is generated by permuting values from the kernel. A control pattern is received that includes a set of vectors each corresponding to the output value position for the kernel-sized tile of the output and indicating a vector of the flattened input stream to access input values. The method further comprises generating, for each output position of each kernel-sized tile of the output, a dot product between a first vector that includes values of the flattened input stream as selected by the control pattern, and a second vector corresponding to a vector in the expanded kernel corresponding to the output position.

    Multichip timing synchronization circuits and methods

    公开(公告)号:US11474557B2

    公开(公告)日:2022-10-18

    申请号:US17021746

    申请日:2020-09-15

    申请人: Groq, Inc.

    IPC分类号: G06F1/12 G06N5/02

    摘要: In one embodiment, the present disclosure includes multichip timing synchronization circuits and methods. In one embodiment, hardware counters in different systems are synchronized. Programs on the systems may include synchronization instructions. A second system executes synchronization instruction, and in response thereto, synchronizes a local software counter to a local hardware counter. The software counter on the second system may be delayed a fixed period of time corresponding to a program delay on the first system. The software counter on the second system may further be delayed by an offset to bring software counters on the two systems into sync.

    Tiled Switch Matrix Data Permutation Circuit

    公开(公告)号:US20220236954A1

    公开(公告)日:2022-07-28

    申请号:US17717629

    申请日:2022-04-11

    申请人: Groq, Inc.

    IPC分类号: G06F7/76 G06F7/78

    摘要: Embodiments of the present disclosure pertain to switch matrix circuit including a data permutation circuit. In one embodiment, the switch matrix comprises a plurality of adjacent switching blocks configured along a first axis, wherein the plurality of adjacent switching blocks each receive data and switch control settings along a second axis. The switch matrix includes a permutation circuit comprising, in each switching block, a plurality of switching stages spanning a plurality of adjacent switching blocks and at least one switching stage that does not span to adjacent switching blocks. The permutation circuit receives data in a first pattern and outputs the data in a second pattern. The data permutation performed by the switching stages is based on the particular switch control settings received in the adjacent switching blocks along the second axis.

    DATA STRUCTURES WITH MULTIPLE READ PORTS

    公开(公告)号:US20220101896A1

    公开(公告)日:2022-03-31

    申请号:US17397158

    申请日:2021-08-09

    申请人: Groq, Inc.

    摘要: A memory structure having 2m read ports allowing for concurrent access to n data entries can be constructed using three memory structures each having 2′ read ports. The three memory structures include two structures providing access to half of the n data entries, and a difference structure providing access to difference data between the halves of the n data entries. Each pair of the 2m ports is connected to a respective port of each of the 2m-1-port data structures, such that each port of the part can access data entries of a first half of the n data entries either by accessing the structure storing that half directly, or by accessing both the difference structure and the structure containing the second half to reconstruct the data entries of the first half, thus allowing for a pair of ports to concurrently access any of the stored data entries in parallel.

    Processor architecture
    9.
    发明授权

    公开(公告)号:US11243880B1

    公开(公告)日:2022-02-08

    申请号:US16132243

    申请日:2018-09-14

    申请人: Groq, Inc.

    IPC分类号: G06F12/02 G06F3/06

    摘要: A processor having a functional slice architecture is divided into a plurality of functional units (“tiles”) organized into a plurality of slices. Each slice is configured to perform specific functions within the processor, which may include memory slices (MEM) for storing operand data, and arithmetic logic slices for performing operations on received operand data. The tiles of the processor are configured to stream operand data across a first dimension, and receive instructions across a second dimension orthogonal to the first dimension. The timing of data and instruction flows are configured such that corresponding data and instructions are received at each tile with a predetermined temporal relationship, allowing operand data to be transmitted between the slices of the processor without any accompanying metadata. Instead, each slice is able to determine what operations to perform on received data based upon the timing at which the data is received.

    Expanded kernel generation
    10.
    发明授权

    公开(公告)号:US11204976B2

    公开(公告)日:2021-12-21

    申请号:US16686864

    申请日:2019-11-18

    申请人: Groq, Inc.

    摘要: A method comprises receiving a kernel used to convolve with an input tensor. For a first dimension of the kernel, a square block of values for each single dimensional vector of the kernel that includes all rotations of that single dimensional vector is generated. For each additional dimension of the kernel, group blocks of an immediately preceding dimension into sets of blocks, each set of blocks including blocks of the immediately preceding dimension that are aligned along a vector that is parallel to the axis of the dimension; and generate, for the additional dimension, one or more blocks of values, each block including all rotations of blocks within each of the sets of blocks of the immediately preceding dimension. The block of values corresponding to the last dimension in the additional dimensions of the kernel is output as the expanded kernel.