Patent search ap:("Amazon Technologies Page Inc.") AND inv:"Paul Gilbert Meyer"

1.

发明申请
MIXING SPARSITY COMPRESSION 有权

公开(公告)号：US20230100930A1

公开(公告)日：2023-03-30

申请号：US17449576

申请日：2021-09-30

Applicant: Amazon Technologies, Inc.

Inventor： Xiaodan Tan , Paul Gilbert Meyer , Gennady Pekhimenko , Randy Renfu Huang

IPC: G06N3/08 , G06N3/04

Abstract: Techniques for compressing a neural network model by mixing compression ratios (sparsity patterns) are described. The weight tensor of a neural network model is divided into weight groups. The pruning cost of compressing the weight values according to a compression ratio is determined for each weight group, and a pruning cost distribution for the compression ratio is generated from the pruning costs of the weight groups. A cost threshold can then be selected from the pruning cost distribution, and weight groups having a pruning cost below the selected cost threshold are compressed according to the compression ratio. The remaining weight groups can be compressed using one or more less aggressive compression ratios. The cost threshold can be adjusted to tune the overall sparsity and accuracy of the compressed neural network.

2.

发明授权
Emulating fine-grained sparsity in a systolic array 有权

公开(公告)号：US11500962B1

公开(公告)日：2022-11-15

申请号：US16917033

申请日：2020-06-30

Applicant: Amazon Technologies, Inc.

Inventor： Paul Gilbert Meyer , Thiam Khean Hah , Randy Renfu Huang , Ron Diamant , Vignesh Vivekraja

IPC: G06F17/16 , G06N3/04

Abstract: To take advantage of the architecture of a systolic array tailored to perform sparse matrix multiplications, a weight matrix can be converted into a set of constrained fine-grained sparse weight matrices. The conversion process may include receiving a request to perform a matrix multiplication operation with a weight matrix, and determining that the weight matrix satisfies a sparsity condition to convert the weight matrix into a set of constrained fine-grained sparse weight matrices. The weight matrix can then be converted into a set of constrained fine-grained sparse weight matrices. Computer instructions can then be generated for an integrated circuit device to perform the requested matrix multiplication operation as a set of sparse matrix multiplication operations using the set of constrained fine-grained sparse weight matrices.

3.

发明授权
Matrix transpose hardware acceleration 有权

公开(公告)号：US11435941B1

公开(公告)日：2022-09-06

申请号：US16911127

申请日：2020-06-24

Applicant: Amazon Technologies, Inc.

Inventor： Kun Xu , Paul Gilbert Meyer , Ron Diamant

IPC: G06F3/06 , G06N5/04 , G06N3/02

Abstract: In one example, an apparatus comprises: a memory array having an array of memory elements arranged in rows and columns, each memory element being configured to store a data element; and a memory access circuit configured to: perform a row write operation to store a first group of data elements at a first row of the array of memory elements; perform a column read operation at a first column of the array of memory elements to obtain a second group of data elements; and perform a column write operation to store a third group of data elements at the first column of the array of memory elements to replace the second group of data elements.

4.

发明申请
SYSTOLIC ARRAY WITH INPUT REDUCTION TO MULTIPLE REDUCED INPUTS 有权

公开(公告)号：US20230004523A1

公开(公告)日：2023-01-05

申请号：US17363900

申请日：2021-06-30

Applicant: Amazon Technologies, Inc.

Inventor： Paul Gilbert Meyer , Thomas A. Volpe , Ron Diamant , Joshua Wayne Bowman , Nishith Desai , Thomas Elmer

IPC: G06F15/80 , G06F7/487 , G06F7/499 , G06F7/501

Abstract: Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reducer can receive a particular input and generate multiple reduced inputs from the input. The reduced inputs can include reduced input data elements and/or a reduced weights. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide multiple reduced inputs with second shorter bit-length to the array. The systolic array may perform multiply-accumulate operations on each unique combination of the multiple reduced input data elements and the reduced weights to generate multiple partial outputs. The systolic array may sum the partial outputs to generate the output.

5.

发明授权
Throughput increase for compute engine 有权

公开(公告)号：US12260214B1

公开(公告)日：2025-03-25

申请号：US17937332

申请日：2022-09-30

Applicant: Amazon Technologies, Inc.

Inventor： Paul Gilbert Meyer , Ron Diamant , Sundeep Amirineni , Sunil Kumar Bathula

IPC: G06F9/38 , G06F9/30 , G06F13/16

Abstract: A compute channel can have multiple computational circuit blocks coupled in series to form a pipeline. The compute channel can perform a computation on an input tensor to generate an output tensor based on an instruction. When the computational does not require all of the computational circuit blocks, the throughput of the compute channel can be increased by splitting the data elements of the input tensor into multiple input data streams. The multiple input data streams are provided to respective subsets of one or more computational circuit blocks in the pipeline using bypass circuitry of the computational circuit blocks, and the computation can be performed on multiple input data streams in the respective subsets of one or more computational circuit blocks to generate multiple output data streams corresponding to the output tensor.

6.

发明授权
Throughput increase for tensor operations 有权

公开(公告)号：US12099840B1

公开(公告)日：2024-09-24

申请号：US18185236

申请日：2023-03-16

Applicant: Amazon Technologies, Inc.

Inventor： Xiaodan Tan , Paul Gilbert Meyer , Ron Diamant

IPC: G06F9/30

CPC classification number: G06F9/30018 , G06F9/30032

Abstract: A technique for performing a tensor operation includes inputting concatenated data words of a first input tensor and concatenated data words of a second input tensor into a compute channel having a plurality of compute stages coupled in series. The concatenated data words of the first input tensor and the second input tensor represented in a first datatype can be converted into data elements represented in a second datatype using a first subset of the compute stages. A binary operation can be performed on each data element represented in the second datatype from the first input tensor with a corresponding data element represented in the second datatype from the second input tensor to generate output data elements of an output tensor represented in the second datatype using a second subset of the compute stages. The output data elements of the output tensor can then be outputted from the compute channel.

7.

发明授权
Programmable vector engine for efficient beam search 有权

公开(公告)号：US12039330B1

公开(公告)日：2024-07-16

申请号：US17447677

申请日：2021-09-14

Applicant: Amazon Technologies, Inc.

Inventor： Paul Gilbert Meyer

IPC: G06F9/30 , G06N3/02

CPC classification number: G06F9/3001 , G06N3/02

Abstract: To perform a beam search operation on an input tensor using a data processor with native hardware support, the data processor can be programmed with a set of instructions. The set of instructions can include a first machine instruction that operates on the input tensor to obtain N largest values in the input tensor, a second machine instruction that operates on the input tensor to obtain indices corresponding to the N largest values in the input tensor, and a third machine instruction that operates on the input tensor to replace the N largest values in the input tensor with a minimum value.

8.

发明公开
COMPUTE ENGINE WITH TRANSPOSE CIRCUITRY 审中-公开

公开(公告)号：US20240103813A1

公开(公告)日：2024-03-28

申请号：US17934145

申请日：2022-09-21

Applicant: Amazon Technologies, Inc.

Inventor： Xiaodan Tan , Paul Gilbert Meyer , Sheng Xu , Ron Diamant

IPC: G06F7/76 , G06F7/57 , G06F17/16

CPC classification number: G06F7/768 , G06F7/57 , G06F17/16

Abstract: An integrated circuit that combines transpose and compute operations may include a transpose circuit coupled to a set of compute channels. Each compute channel may include multiple arithmetic logic unit (ALU) circuits coupled in series. The transpose circuit is operable to receive an input tensor, transpose the input tensor, and output a transposed tensor to the set of compute channels. The set of compute channels is operable to generate outputs in parallel, with each of the outputs being generated from a corresponding vector of the transposed tensor.

9.

发明授权
Increasing performance of computational array accelerators 有权

公开(公告)号：US12182691B1

公开(公告)日：2024-12-31

申请号：US17249900

申请日：2021-03-17

Applicant: Amazon Technologies, Inc.

Inventor： Sundeep Amirineni , Akshay Balasubramanian , Joshua Wayne Bowman , Ron Diamant , Paul Gilbert Meyer , Thomas Elmer

IPC: G06N3/063 , G06F7/544 , G06F9/30

Abstract: To improve performance of a computational array, the architecture of the array can be modified to allow the processing engines of a column to operate in parallel and the clock frequency of the array to be increased. The processing engines of each column of the array can be grouped into a series of row groups. The processing engines of each row group can be loaded with input values, and computations on the input values can be carried out in parallel to generate the column output. One or more flip-flop stages can be inserted into the computational logic of each of the processing engines. The computational logic can then be distributed across the flip-flop stages to reduce the propagation delay between flip-flop stages of the processing engine, hence allowing the clock frequency of the array to be increased.

10.

发明授权
Emulating fine-grained sparsity in a systolic array 有权

公开(公告)号：US12130885B1

公开(公告)日：2024-10-29

申请号：US18052527

申请日：2022-11-03

Applicant: Amazon Technologies, Inc.

Inventor： Paul Gilbert Meyer , Thiam Khean Hah , Randy Renfu Huang , Ron Diamant , Vignesh Vivekraja

IPC: G06F17/16 , G06N3/04

CPC classification number: G06F17/16 , G06N3/04

Abstract: To take advantage of the architecture of a systolic array tailored to perform sparse matrix multiplications, a weight matrix can be converted into a set of constrained fine-grained sparse weight matrices. The conversion process may include receiving a request to perform a matrix multiplication operation with a weight matrix, and determining that the weight matrix satisfies a sparsity condition to convert the weight matrix into a set of constrained fine-grained sparse weight matrices. The weight matrix can then be converted into a set of constrained fine-grained sparse weight matrices. Computer instructions can then be generated for an integrated circuit device to perform the requested matrix multiplication operation as a set of sparse matrix multiplication operations using the set of constrained fine-grained sparse weight matrices.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification