-
公开(公告)号:US10572409B1
公开(公告)日:2020-02-25
申请号:US15976722
申请日:2018-05-10
Applicant: Xilinx, Inc.
Inventor: Jindrich Zejda , Ling Liu , Yifei Zhou , Ashish Sirasao
Abstract: A memory arrangement can store a matrix of matrix data elements specified as index-value pairs that indicate row and column indices and associated values. First split-and-merge circuitry is coupled between the memory arrangement and a first set of FIFO buffers for reading the matrix data elements from the memory arrangement and putting the matrix data elements in the first set of FIFO buffers based on column indices. A pairing circuit is configured to read vector data elements, pair the vector data elements with the matrix data elements, and put the paired matrix and vector data elements in a second set of FIFO buffers based on column indices. Second split-and-merge circuitry is configured to read paired matrix and vector data elements from the second set of FIFO buffers and put the paired matrix and vector data elements in a third set of FIFO buffers based on row indices.
-
公开(公告)号:US20190114499A1
公开(公告)日:2019-04-18
申请号:US15786267
申请日:2017-10-17
Applicant: Xilinx, Inc.
Inventor: Elliott Delaye , Ashish Sirasao , Aaron Ng , Yongjun Wu , Jindrich Zejda
Abstract: An example preprocessor circuit for formatting image data into a plurality of streams of image samples includes: a first buffer configured to store a plurality of rows of the image data and output a row of the plurality of rows; a second buffer, coupled to the first buffer, including a plurality of storage locations to store a respective plurality of image samples of the row output by the first buffer; a plurality of shift registers; an interconnect network including a plurality of connections, each connection coupling a respective one of the plurality of shift registers to more than one of the plurality of storage locations, one or more of the plurality of storage locations being coupled to more than one of the plurality of connections; and a control circuit configured to load the plurality of shift registers with the plurality of image samples based on the plurality of connections and shift the plurality of shift registers to output the plurality of streams of image samples.
-
公开(公告)号:US20190114533A1
公开(公告)日:2019-04-18
申请号:US15785679
申请日:2017-10-17
Applicant: Xilinx, Inc.
Inventor: Aaron Ng , Jindrich Zejda , Elliott Delaye , Xiao Teng , Sonal Santan , Soren T. Soe , Ashish Sirasao , Ehsan Ghasemi , Sean Settle
Abstract: Embodiments herein describe techniques for interfacing a neural network application with a neural network accelerator using a library. The neural network application may execute on a host computing system while the neural network accelerator executes on a massively parallel hardware system, e.g., a FPGA. The library operates a pipeline for submitting the tasks received from the neural network application to the neural network accelerator. In one embodiment, the pipeline includes a pre-processing stage, an FPGA execution stage, and a post-processing stage which each correspond to different threads. When receiving a task from the neural network application, the library generates a packet that includes the information required for the different stages in the pipeline to perform the tasks. Because the stages correspond to different threads, the library can process multiple packets in parallel which can increase the utilization of the neural network accelerator on the hardware system.
-
14.
公开(公告)号:US09842187B1
公开(公告)日:2017-12-12
申请号:US15082993
申请日:2016-03-28
Applicant: Xilinx, Inc.
Inventor: Jindrich Zejda , Atul Srinivasan , Ilya K. Ganusov , Walter A. Manaker, Jr. , Benjamin S. Devlin , Satish B. Sivaswamy
IPC: G06F17/50
CPC classification number: G06F17/5081 , G06F17/5054 , G06F2217/84
Abstract: Approaches for processing a circuit design include determining pin slack values for pins of the circuit elements in the circuit design. A processor selects a subset of endpoints based on pin slack values of the endpoints being in a critical slack range and determines startpoints of the circuit design that are in respective critical fanin cones. For each endpoint of the subset, the processor determines an arrival time from each startpoint in the respective critical fanin cone and determines for each startpoint-endpoint pair, a respective set of constraint values as a function of the respective arrival time from the startpoint. The processor generates a graph in the memory circuit from the startpoint-endpoint pairs. First nodes in the graph represent the startpoints and second nodes in the graph represent the endpoints, and values in the respective set of constraint values are associated with edges that connect the nodes.
-
公开(公告)号:US12086572B1
公开(公告)日:2024-09-10
申请号:US15786452
申请日:2017-10-17
Applicant: Xilinx, Inc.
Inventor: Yongjun Wu , Jindrich Zejda , Elliott Delaye , Ashish Sirasao
CPC classification number: G06F8/313 , G06F8/47 , G06F12/0646 , G06N3/04 , G06N20/00
Abstract: Embodiments herein describe techniques for expressing the layers of a neural network in a software model. In one embodiment, the software model includes a class that describes the various functional blocks (e.g., convolution units, max-pooling units, rectified linear units (ReLU), and scaling functions) used to execute the neural network layers. In turn, other classes in the software model can describe the operation of each of the functional blocks. In addition, the software model can include conditional logic for expressing how the data flows between the functional blocks since different layers in the neural network can process the data differently. A compiler can convert the high-level code in the software model (e.g., C++) into a hardware description language (e.g., register transfer level (RTL)) which is used to configure a hardware system to implement a neural network accelerator.
-
公开(公告)号:US12061990B2
公开(公告)日:2024-08-13
申请号:US15786434
申请日:2017-10-17
Applicant: Xilinx, Inc.
Inventor: Yongjun Wu , Jindrich Zejda , Elliott Delaye , Ashish Sirasao
Abstract: Embodiments herein describe techniques for static scheduling a neural network implemented in a massively parallel hardware system. The neural network may be scheduled using three different scheduling levels referred to herein as an upper level, an intermediate level, and a lower level. In one embodiment, the upper level includes a hardware or software model of the layers in the neural network that establishes a sequential order of functions that operate concurrently in the hardware system. In the intermediate level, identical processes in the functions defined in the upper level are connected to form a systolic array or mesh and balanced data flow channels are used to minimize latency. In the lower level, a compiler can assign the operations performed by the processing elements in the systolic array to different portions of the hardware system to provide a static schedule for the neural network.
-
公开(公告)号:US11694066B2
公开(公告)日:2023-07-04
申请号:US15785679
申请日:2017-10-17
Applicant: Xilinx, Inc.
Inventor: Aaron Ng , Jindrich Zejda , Elliott Delaye , Xiao Teng , Sonal Santan , Soren T. Soe , Ashish Sirasao , Ehsan Ghasemi , Sean Settle
Abstract: Embodiments herein describe techniques for interfacing a neural network application with a neural network accelerator using a library. The neural network application may execute on a host computing system while the neural network accelerator executes on a massively parallel hardware system, e.g., a FPGA. The library operates a pipeline for submitting the tasks received from the neural network application to the neural network accelerator. In one embodiment, the pipeline includes a pre-processing stage, an FPGA execution stage, and a post-processing stage which each correspond to different threads. When receiving a task from the neural network application, the library generates a packet that includes the information required for the different stages in the pipeline to perform the tasks. Because the stages correspond to different threads, the library can process multiple packets in parallel which can increase the utilization of the neural network accelerator on the hardware system.
-
18.
公开(公告)号:US11036827B1
公开(公告)日:2021-06-15
申请号:US15786346
申请日:2017-10-17
Applicant: Xilinx, Inc.
Inventor: Jindrich Zejda , Elliott Delaye , Yongjun Wu , Aaron Ng , Ashish Sirasao , Khang K. Dao
Abstract: Methods and apparatus are described for simultaneously buffering and reformatting (e.g., transposing) a matrix for high-speed data streaming in general matrix multiplication (GEMM), which may be implemented by a programmable integrated circuit (IC). Examples of the present disclosure increase the effective double data rate (DDR) memory throughput for streaming data into GEMM digital signal processing (DSP) engine multifold, as well as eliminate slow data reformatting on a host central processing unit (CPU). This may be accomplished through software-defined (e.g., C++) data structures and access patterns that result in hardware logic that simultaneously buffers and reorganizes the data to achieve linear DDR addressing.
-
公开(公告)号:US10943039B1
公开(公告)日:2021-03-09
申请号:US15786105
申请日:2017-10-17
Applicant: Xilinx, Inc.
Inventor: Ashish Sirasao , Elliott Delaye , Sean Settle , Zhao Ma , Ehsan Ghasemi , Xiao Teng , Aaron Ng , Jindrich Zejda
IPC: G06F30/327 , G06F7/544 , G06N3/04 , G06F30/34
Abstract: An example multiply accumulate (MACC) circuit includes: a multiply-accumulator having an accumulator output register; a quantizer, coupled to the multiply accumulator; and a control circuit coupled to the multiply-accumulator and the quantizer, the control circuit configured to provide control data to the quantizer, the control data indicative of a most-significant bit (MSB) to least significant bit (LSB) range for selecting bit indices from the accumulator output register.
-
公开(公告)号:US20190114535A1
公开(公告)日:2019-04-18
申请号:US15786288
申请日:2017-10-17
Applicant: Xilinx, Inc.
Inventor: Aaron Ng , Jindrich Zejda , Elliott Delaye , Xiao Teng , Ashish Sirasao
Abstract: A disclosed neural network processing system includes a host computer system, a RAMs coupled to the host computer system, and neural network accelerators coupled to the RAMs, respectively. The host computer system is configured with software that when executed causes the host computer system to write input data and work requests to the RAMS. Each work request specifies a subset of neural network operations to perform and memory locations in a RAM of the input data and parameters. A graph of dependencies among neural network operations is built and additional dependencies added. The operations are partitioned into coarse grain tasks and fine grain subtasks for optimal scheduling for parallel execution. The subtasks are scheduled to accelerator kernels of matching capabilities. Each neural network accelerator is configured to read a work request from the respective RAM and perform the subset of neural network operations on the input data using the parameters.
-
-
-
-
-
-
-
-
-