-
公开(公告)号:US11809953B1
公开(公告)日:2023-11-07
申请号:US17902702
申请日:2022-09-02
Applicant: Amazon Technologies, Inc.
Inventor: Samuel Jacob , Ilya Minkin , Mohammad El-Shabani
Abstract: Embodiments include techniques for enabling execution of N inferences on an execution engine of a neural network device. Instruction code for a single inference is stored in a memory that is accessible by a DMA engine, the instruction code forming a regular code block. A NOP code block and a reset code block for resetting an instruction DMA queue are stored in the memory. The instruction DMA queue is generated such that, when it is executed by the DMA engine, it causes the DMA engine to copy, for each of N inferences, both the regular code block and an additional code block to an instruction buffer. The additional code block is the NOP code block for the first N−1 inferences and is the reset code block for the Nth inference. When the reset code block is executed by the execution engine, the instruction DMA queue is reset.
-
公开(公告)号:US11550736B1
公开(公告)日:2023-01-10
申请号:US17449581
申请日:2021-09-30
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant , Ilya Minkin , Mohammad El-Shabani , Raymond S. Whiteside , Uday Shilton Udayaselvam
Abstract: To reduce direct memory access (DMA) overhead, a tensorized descriptor can be used to generate a series of memory descriptors to perform a series of DMA data transfers. The tensorized descriptor may include attributes such as a stride and a memory descriptor template, which can be used to generate the series of memory descriptors. Hence, instead of having to retrieve each of the memory descriptors to perform the series of DMA transfers, a single tensorized descriptor can be retrieved to perform a series of data transfers.
-
公开(公告)号:US11182314B1
公开(公告)日:2021-11-23
申请号:US16698761
申请日:2019-11-27
Applicant: Amazon Technologies, Inc.
Inventor: Drazen Borkovic , Ilya Minkin , Vignesh Vivekraja , Richard John Heaton , Randy Renfu Huang
Abstract: An integrated circuit device implementing a neural network accelerator may have a peripheral bus interface to interface with a host memory, and neural network models can be loaded from the host memory onto the state buffer of the neural network accelerator for execution by the array of processing elements. The neural network accelerator may also have a memory interface to interface with a local memory. The local memory may store neural network models from the host memory, and the models can be loaded from the local memory into the state buffer with reduced latency as compared to loading from the host memory. In systems with multiple accelerators, the models in the local memory can also be shared amongst different accelerators.
-
公开(公告)号:US11868872B1
公开(公告)日:2024-01-09
申请号:US16836493
申请日:2020-03-31
Applicant: Amazon Technologies, Inc.
Inventor: Ilya Minkin , Ron Diamant , Kun Xu
CPC classification number: G06N3/063 , G06F12/0292 , G06F12/1081 , G06F13/1605 , G06F13/28 , G06N3/045 , G11C15/04 , G06F2212/152 , G06F2213/2802
Abstract: In one example, an apparatus comprises: a direct memory access (DMA) descriptor queue that stores DMA descriptors, each DMA descriptor including an indirect address; an address translation table that stores an address mapping between indirect addresses and physical addresses; and a DMA engine configured to: fetch a DMA descriptor from the DMA descriptor queue to the address translation table to translate a first indirect address of the DMA descriptor to a first physical address based on the address mapping, and perform a DMA operation based on executing the DMA descriptor to transfer data to or from the first physical address.
-
公开(公告)号:US11531578B1
公开(公告)日:2022-12-20
申请号:US16216887
申请日:2018-12-11
Applicant: Amazon Technologies, Inc.
Inventor: Richard John Heaton , Ilya Minkin
Abstract: Remote access for debugging or profiling a remotely executing neural network graph can be performed by a client using an in-band application programming interface (API). The client can provide indicator flags for debugging or profiling in an inference request sent to a remote server computer executing the neural network graph using the API. The remote server computer can collect metadata for debugging or profiling during the inference operation using the neural network graph and send it back to the client using the same API. Additionally, the metadata can be collected at various granularity levels also specified in the inference request.
-
公开(公告)号:US12204757B1
公开(公告)日:2025-01-21
申请号:US18067514
申请日:2022-12-16
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant , Ilya Minkin , Raymond S. Whiteside
IPC: G06F3/06
Abstract: A technique for processing strong ordered transactions in a direct memory access engine may include retrieving a memory descriptor to perform a strong ordered transaction, and delaying the strong ordered transaction until pending write transactions associated with previous memory descriptors retrieved prior to the memory descriptor are complete. Subsequent transactions associated with memory descriptors following the memory descriptor are allowed to be issued while waiting for the pending write transactions to complete. Upon completion of the pending write transactions, the strong ordered transaction is performed.
-
公开(公告)号:US11175919B1
公开(公告)日:2021-11-16
申请号:US16219610
申请日:2018-12-13
Applicant: Amazon Technologies, Inc.
Inventor: Ilya Minkin , Ron Diamant , Drazen Borkovic , Jindrich Zejda , Dana Michelle Vantrease
Abstract: Integrated circuit devices and methods for synchronizing execution of program code for multiple concurrently operating execution engines of the integrated circuit devices are provided. In some cases, one execution engine of an integrated circuit device may be dependent on the operation of another execution engine of the integrated circuit device. To synchronize the execution engines around the dependency, a first execution engine may execute an instruction to set a value in a register while a second execution engine may execute an instruction to wait for a condition associated with the register value.
-
公开(公告)号:US11119787B1
公开(公告)日:2021-09-14
申请号:US16368263
申请日:2019-03-28
Applicant: Amazon Technologies, Inc.
Inventor: Mohammad El-Shabani , Ron Diamant , Samuel Jacob , Ilya Minkin , Richard John Heaton
IPC: G06F9/44 , G06F8/41 , G06F11/30 , G06F9/38 , G06F11/22 , G06F9/455 , G06F11/36 , G06F9/445 , G06F11/34 , G06F9/30
Abstract: Systems and methods for non-intrusive hardware profiling are provided. In some cases integrated circuit devices can be manufactured without native support for performance measurement and/or debugging capabilities, thereby limiting visibility into the integrated circuit device. Understanding the timing of operations can help to determine whether the hardware of the device is operating correctly and, when the device is not operating correctly, provide information that can be used to debug the device. In order to measure execution time of various tasks performed by the integrated circuit device, program instructions may be inserted to generate notifications that provide tracing information, including timestamps, for operations executed by the integrated circuit device.
-
公开(公告)号:US10922146B1
公开(公告)日:2021-02-16
申请号:US16219530
申请日:2018-12-13
Applicant: Amazon Technologies, Inc.
Inventor: Ilya Minkin , Ron Diamant , Drazen Borkovic , Jindrich Zejda , Dana Michelle Vantrease
Abstract: Systems and methods are provided for synchronizing execution of program code for an integrated circuit device having multiple concurrently operating execution engines, where the operation of one execution engine may be dependent on the operation of another execution engine. Data or resource dependencies may be accommodated with a Set instruction to cause a first execution engine to set a register value and a Wait instruction to cause a second execution engine to wait for a condition associate with the register value. Concurrently operation of the execution engines may thus be synchronized.
-
公开(公告)号:US11983128B1
公开(公告)日:2024-05-14
申请号:US18067109
申请日:2022-12-16
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant , Ilya Minkin , Mohammad El-Shabani , Raymond S. Whiteside , Uday Shilton Udayaselvam
CPC classification number: G06F13/30 , G06F13/1621 , G06F13/1642
Abstract: Techniques to reduce overhead in a direct memory access (DMA) engine can include processing descriptors from a descriptor queue to obtain a striding configuration to generate tensorized memory descriptors. The striding configuration can include, for each striding dimension, a stride and a repetition number indicating a number of times to repeat striding in the corresponding striding dimension. One or more sets of tensorized memory descriptors can be generated based on the striding configuration. Data transfers are then performed based on the generated tensorized memory descriptors.
-
-
-
-
-
-
-
-
-