PERMUTING IN A MATRIX-VECTOR PROCESSOR
    63.
    发明公开

    公开(公告)号:US20240211534A1

    公开(公告)日:2024-06-27

    申请号:US18241805

    申请日:2023-09-01

    Applicant: Google LLC

    Abstract: A circuit comprises an input register configured to receive an input vector of elements, a control register configured to receive a control vector of elements, wherein each element of the control vector corresponds to a respective element of the input vector, and wherein each element specifies a permutation of a corresponding element of the input vector, and a permute execution circuit configured to generate an output vector of elements corresponding to a permutation of the input vector. Generating each element of the output vector comprises accessing, at the input register, a particular element of the input vector, accessing, at the control register, a particular element of the control vector corresponding to the particular element of the input vector, and outputting the particular element of the input vector as an element at a particular position of the output vector that is selected based on the particular element of the control vector.

    VECTOR REDUCTION PROCESSOR
    64.
    发明公开

    公开(公告)号:US20240168914A1

    公开(公告)日:2024-05-23

    申请号:US18429142

    申请日:2024-01-31

    Applicant: Google LLC

    Abstract: A vector reduction circuit configured to reduce an input vector of elements comprises a plurality of cells, wherein each of the plurality of cells other than a designated first cell that receives a designated first element of the input vector is configured to receive a particular element of the input vector, receive, from another of the one or more cells, a temporary reduction element, perform a reduction operation using the particular element and the temporary reduction element, and provide, as a new temporary reduction element, a result of performing the reduction operation using the particular element and the temporary reduction element. The vector reduction circuit also comprises an output circuit configured to provide, for output as a reduction of the input vector, a new temporary reduction element corresponding to a result of performing the reduction operation using a last element of the input vector.

    Low latency matrix multiply unit
    65.
    发明授权

    公开(公告)号:US11989259B2

    公开(公告)日:2024-05-21

    申请号:US17985069

    申请日:2022-11-10

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. The matrix multiply unit may include cells arranged in columns of the systolic array. Two chains of weight shift registers per column of the systolic array are in the matrix multiply unit. Each weight shift register is connected to only one chain and each cell is connected to only one weight shift register. A weight matrix register per cell is configured to store a weight input received from a weight shift register. A multiply unit is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

    SHARED SCRATCHPAD MEMORY WITH PARALLEL LOAD-STORE

    公开(公告)号:US20240160909A1

    公开(公告)日:2024-05-16

    申请号:US18423203

    申请日:2024-01-25

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer-readable media, are described for a hardware circuit configured to implement a neural network. The circuit includes a first memory, respective first and second processor cores, and a shared memory. The first memory provides data for performing computations to generate an output for a neural network layer. Each of the first and second cores include a vector memory for storing vector values derived from the data provided by the first memory. The shared memory is disposed generally intermediate the first memory and at least one core and includes: i) a direct memory access (DMA) data path configured to route data between the shared memory and the respective vector memories of the first and second cores and ii) a load-store data path configured to route data between the shared memory and respective vector registers of the first and second cores.

    Shared scratchpad memory with parallel load-store

    公开(公告)号:US11922292B2

    公开(公告)日:2024-03-05

    申请号:US15931970

    申请日:2020-05-14

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer-readable media, are described for a hardware circuit configured to implement a neural network. The circuit includes a first memory, respective first and second processor cores, and a shared memory. The first memory provides data for performing computations to generate an output for a neural network layer. Each of the first and second cores include a vector memory for storing vector values derived from the data provided by the first memory. The shared memory is disposed generally intermediate the first memory and at least one core and includes: i) a direct memory access (DMA) data path configured to route data between the shared memory and the respective vector memories of the first and second cores and ii) a load-store data path configured to route data between the shared memory and respective vector registers of the first and second cores.

    Low latency matrix multiply unit
    68.
    发明授权

    公开(公告)号:US11907330B2

    公开(公告)日:2024-02-20

    申请号:US18111468

    申请日:2023-02-17

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. Each cell of the matrix multiply includes: a weight matrix register configured to receive a weight input from either a transposed or a non-transposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; a non-transposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiply unit that is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

Patent Agency Ranking