Patent search ap:("Amazon Technologies Page Inc.") AND inv:"Paul Gilbert Meyer"

11.

发明授权
Matrix transpose hardware acceleration 有权

公开(公告)号：US12141468B1

公开(公告)日：2024-11-12

申请号：US17875805

申请日：2022-07-28

Applicant: Amazon Technologies, Inc.

Inventor： Kun Xu , Paul Gilbert Meyer , Ron Diamant

IPC: G06F3/06 , G06N3/02 , G06N5/04

Abstract: In one example, an apparatus comprises: a memory array having an array of memory elements arranged in rows and columns, each memory element being configured to store a data element; and a memory access circuit configured to: perform a row write operation to store a first group of data elements at a first row of the array of memory elements; perform a column read operation at a first column of the array of memory elements to obtain a second group of data elements; and perform a column write operation to store a third group of data elements at the first column of the array of memory elements to replace the second group of data elements.

12.

发明公开
PROGRAMMABLE COMPUTE ENGINE HAVING TRANSPOSE OPERATIONS 审中-公开

公开(公告)号：US20240111528A1

公开(公告)日：2024-04-04

申请号：US17934147

申请日：2022-09-21

Applicant: Amazon Technologies, Inc.

Inventor： Xiaodan Tan , Paul Gilbert Meyer , Sheng Xu , Ron Diamant

IPC: G06F9/30 , G06F9/355

CPC classification number: G06F9/30036 , G06F9/30145 , G06F9/3555

Abstract: A technique to execute transpose and compute operations may include retrieving a set of machine instructions from an instruction buffer of a data processor. The instruction buffer has multiple entries, and each entry stores one machine instruction. A machine instruction from the set of machine instructions is executed to transpose a submatrix of an input tensor and perform computations on column elements of the submatrix. The machine instruction combines the transpose operation with computational operations into a single machine instruction.

13.

发明授权
Using shared data bus to support systolic array tiling 有权

公开(公告)号：US11625453B1

公开(公告)日：2023-04-11

申请号：US16712699

申请日：2019-12-12

Applicant: Amazon Technologies, Inc.

Inventor： Paul Gilbert Meyer , Ron Diamant

IPC: G06F17/16 , G06F13/40 , G06F15/80

Abstract: To improve utilization of a systolic array, each row of the array is provided with a number of general purpose row input data buses. Each of the general purpose row input data buses can be operable to transfer either feature map (FMAP) input elements or weight values into the processing elements of the corresponding row of the array. By using such general purpose row input data buses, concurrent matrix multiplications as well as faster background weight loading can be achieved in the array.

14.

发明申请
SYSTOLIC ARRAY WITH EFFICIENT INPUT REDUCTION AND EXTENDED ARRAY PERFORMANCE 有权

公开(公告)号：US20230004384A1

公开(公告)日：2023-01-05

申请号：US17363894

申请日：2021-06-30

Applicant: Amazon Technologies, Inc.

Inventor： Paul Gilbert Meyer , Thomas A Volpe , Ron Diamant , Joshua Wayne Bowman , Nishith Desai , Thomas Elmer

IPC: G06F9/30 , G06F15/80

Abstract: Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reduced input can include a reduced input data element and/or a reduced weight. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide the reduced input to the array. In order to reduce the bit-length, the reducer may reduce the number of trailing bits of the input. Further, the systolic array can receive a reduced and rounded input. The systolic array can propagate the reduced input through the processing elements in the systolic array. Each processing element may include a multiplier and/or an adder to perform arithmetical operations based on the reduced input.

15.

发明授权
Configuration of a deep vector engine using an opcode table, control table, and datapath table 有权

公开(公告)号：US12271732B1

公开(公告)日：2025-04-08

申请号：US17937333

申请日：2022-09-30

Applicant: Amazon Technologies, Inc.

Inventor： Paul Gilbert Meyer , Ron Diamant , Sundeep Amirineni

IPC: G06F9/22 , G06F9/30 , G06F9/38 , G06F15/78

Abstract: A technique to program a compute channel having multiple computational circuit blocks coupled in series in a pipeline can include receiving a machine instruction for the compute channel. The machine instruction is decoded to obtain an opcode, and the opcode can be used as an index to access an opcode entry in an opcode table. The opcode entry contains a pointer to a microoperation, and the pointer can be used to access a microoperation represented by a control entry in a control table and a datapath configuration entry in a datapath table. The microoperation can then be issued to the compute channel by configuring the compute channel with the control entry and the datapath configuration entry.

16.

发明授权
Increasing performance of computational array accelerators 有权

公开(公告)号：US12182691B1

公开(公告)日：2024-12-31

申请号：US17249900

申请日：2021-03-17

Applicant: Amazon Technologies, Inc.

Inventor： Sundeep Amirineni , Akshay Balasubramanian , Joshua Wayne Bowman , Ron Diamant , Paul Gilbert Meyer , Thomas Elmer

IPC: G06N3/063 , G06F7/544 , G06F9/30

Abstract: To improve performance of a computational array, the architecture of the array can be modified to allow the processing engines of a column to operate in parallel and the clock frequency of the array to be increased. The processing engines of each column of the array can be grouped into a series of row groups. The processing engines of each row group can be loaded with input values, and computations on the input values can be carried out in parallel to generate the column output. One or more flip-flop stages can be inserted into the computational logic of each of the processing engines. The computational logic can then be distributed across the flip-flop stages to reduce the propagation delay between flip-flop stages of the processing engine, hence allowing the clock frequency of the array to be increased.

17.

发明授权
Emulating fine-grained sparsity in a systolic array 有权

公开(公告)号：US12130885B1

公开(公告)日：2024-10-29

申请号：US18052527

申请日：2022-11-03

Applicant: Amazon Technologies, Inc.

Inventor： Paul Gilbert Meyer , Thiam Khean Hah , Randy Renfu Huang , Ron Diamant , Vignesh Vivekraja

IPC: G06F17/16 , G06N3/04

CPC classification number: G06F17/16 , G06N3/04

Abstract: To take advantage of the architecture of a systolic array tailored to perform sparse matrix multiplications, a weight matrix can be converted into a set of constrained fine-grained sparse weight matrices. The conversion process may include receiving a request to perform a matrix multiplication operation with a weight matrix, and determining that the weight matrix satisfies a sparsity condition to convert the weight matrix into a set of constrained fine-grained sparse weight matrices. The weight matrix can then be converted into a set of constrained fine-grained sparse weight matrices. Computer instructions can then be generated for an integrated circuit device to perform the requested matrix multiplication operation as a set of sparse matrix multiplication operations using the set of constrained fine-grained sparse weight matrices.

18.

发明授权
Resizable scratchpad memory 有权

公开(公告)号：US12045475B1

公开(公告)日：2024-07-23

申请号：US17457502

申请日：2021-12-03

Applicant: Amazon Technologies, Inc.

Inventor： Paul Gilbert Meyer , Patricio Kaplan , Sundeep Amirineni , Laura Sharpless , Ron Diamant , Akshay Balasubramanian

IPC: G06F3/06 , G06F12/02

CPC classification number: G06F3/0631 , G06F3/0604 , G06F3/064 , G06F3/0656 , G06F3/0659 , G06F3/0679 , G06F12/0246

Abstract: Techniques for implementing a dynamically resizable memory region for alternative use in a memory are described. The techniques may include using two concurrent address maps corresponding to two address ranges for a memory represented as an array of memory blocks. The first address range can be mapped to the memory with starting addresses of the memory blocks incrementing sequentially along each row. The second address range can be mapped to the memory with starting addresses of the memory blocks incrementing sequentially along each column. When an access request is received having a target address belonging to the first address range, the target address is provided as the memory address to access the memory. When an access request having a target address belonging to the second address range, the target address is translated by address translation logic into a memory address to access the memory.

19.

发明授权
Programmable compute engine having transpose operations 有权

公开(公告)号：US12008368B2

公开(公告)日：2024-06-11

申请号：US17934147

申请日：2022-09-21

Applicant: Amazon Technologies, Inc.

Inventor： Xiaodan Tan , Paul Gilbert Meyer , Sheng Xu , Ron Diamant

IPC: G06F9/30 , G06F9/355

CPC classification number: G06F9/30036 , G06F9/30145 , G06F9/3555

Abstract: A technique to execute transpose and compute operations may include retrieving a set of machine instructions from an instruction buffer of a data processor. The instruction buffer has multiple entries, and each entry stores one machine instruction. A machine instruction from the set of machine instructions is executed to transpose a submatrix of an input tensor and perform computations on column elements of the submatrix. The machine instruction combines the transpose operation with computational operations into a single machine instruction.

20.

发明授权
Machine instructions for decoding acceleration including fuse input instructions to fuse multiple JPEG data blocks together to take advantage of a full SIMD width of a processor 有权

公开(公告)号：US11941397B1

公开(公告)日：2024-03-26

申请号：US17804796

申请日：2022-05-31

Applicant: Amazon Technologies, Inc.

Inventor： Xiaodan Tan , Paul Gilbert Meyer

IPC: G06F9/30 , G06F9/38

CPC classification number: G06F9/30101 , G06F9/3001 , G06F9/30032 , G06F9/30036 , G06F9/30043 , G06F9/3887

Abstract: Techniques to take advantage of the single-instruction-multiple-data (SIMD) capabilities of a processor to process data blocks can include implementing an instruction to fuse the data blocks together. The fuse input instruction can have a first input vector, a second input vector, a select input, a first output vector, and a second output vector. The fuse input instruction selects a portion of the first input vector and a portion of the second input vector based on the select input, sign extends the selected portion of the first input vector and the selected portion of the second input vector, and shuffles data elements of the sign extended portion of the first input vector with data elements of the sign extended portion of the second input vector to generate the first and second output vectors.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification