Loop optimization for implementing circuit designs in hardware

    公开(公告)号:US10331836B1

    公开(公告)日:2019-06-25

    申请号:US15730431

    申请日:2017-10-11

    Applicant: Xilinx, Inc.

    Abstract: Implementing a circuit design can include determining a chain of a plurality of loop elements of a circuit design, wherein each loop element includes a bit select node configured to perform a bit assignment operation and a corresponding address calculation node, wherein the address calculation nodes use a common variable to calculate a starting bit location provided to the corresponding bit select node. In response to the determining, the chain is replicated resulting in one chain for each value of the common variable and transforming each chain into a plurality of wires. A multiplexer is inserted into the circuit design. The plurality of wires for each chain is coupled to inputs of the multiplexer and the common variable is provided to the multiplexer as a select signal.

    MACHINE LEARNING RUNTIME LIBRARY FOR NEURAL NETWORK ACCELERATION

    公开(公告)号:US20190114533A1

    公开(公告)日:2019-04-18

    申请号:US15785679

    申请日:2017-10-17

    Applicant: Xilinx, Inc.

    Abstract: Embodiments herein describe techniques for interfacing a neural network application with a neural network accelerator using a library. The neural network application may execute on a host computing system while the neural network accelerator executes on a massively parallel hardware system, e.g., a FPGA. The library operates a pipeline for submitting the tasks received from the neural network application to the neural network accelerator. In one embodiment, the pipeline includes a pre-processing stage, an FPGA execution stage, and a post-processing stage which each correspond to different threads. When receiving a task from the neural network application, the library generates a packet that includes the information required for the different stages in the pipeline to perform the tasks. Because the stages correspond to different threads, the library can process multiple packets in parallel which can increase the utilization of the neural network accelerator on the hardware system.

    Folding duplicate instances of modules in a circuit design

    公开(公告)号:US09875330B2

    公开(公告)日:2018-01-23

    申请号:US14960176

    申请日:2015-12-04

    Applicant: Xilinx, Inc.

    CPC classification number: G06F17/5072 G06F17/5045 G06F17/505 G06F17/5054

    Abstract: Disclosed approaches for processing a circuit design include identifying duplicate instances of a module in a representation of the circuit design. A processor circuit performs folding operations for at least one pair of the duplicate instances of the module. One instance of the duplicates is removed from the circuit design, and a multiplexer is inserted. The multiplexer receives and selects one of the input signals to the duplicate instances and provides the selected input signal to the remaining instance. For each flip-flop in the remaining instance, a pipelined flip-flop is inserted. Connections to a first clock signal in the remaining instance are replaced with connections to a second clock signal having twice the frequency of the first clock signal. An alignment circuit is inserted to receive the output signal from the first instance and provide concurrent first and second output signals.

    Neural network processing system having host controlled kernel acclerators

    公开(公告)号:US11568218B2

    公开(公告)日:2023-01-31

    申请号:US15786288

    申请日:2017-10-17

    Applicant: Xilinx, Inc.

    Abstract: A disclosed neural network processing system includes a host computer system, a RAMs coupled to the host computer system, and neural network accelerators coupled to the RAMs, respectively. The host computer system is configured with software that when executed causes the host computer system to write input data and work requests to the RAMS. Each work request specifies a subset of neural network operations to perform and memory locations in a RAM of the input data and parameters. A graph of dependencies among neural network operations is built and additional dependencies added. The operations are partitioned into coarse grain tasks and fine grain subtasks for optimal scheduling for parallel execution. The subtasks are scheduled to accelerator kernels of matching capabilities. Each neural network accelerator is configured to read a work request from the respective RAM and perform the subset of neural network operations on the input data using the parameters.

    Image preprocessing for generalized image processing

    公开(公告)号:US11386644B2

    公开(公告)日:2022-07-12

    申请号:US15786267

    申请日:2017-10-17

    Applicant: Xilinx, Inc.

    Abstract: An example preprocessor circuit includes: a first buffer configured to store rows of image data and output a row thereof; a second buffer, coupled to the first buffer, including storage locations to store respective image samples of the row output by the first buffer; shift registers; an interconnect network including connections, each connection coupling a respective one of the shift registers to more than one of the storage locations, one or more of the storage locations being coupled to more than one of the connections; and a control circuit configured to load the shift registers with the image samples based on the connections and shift the shift registers to output streams of image samples.

    Circuit arrangements and methods for traversing input feature maps

    公开(公告)号:US11106968B1

    公开(公告)日:2021-08-31

    申请号:US15989075

    申请日:2018-05-24

    Applicant: Xilinx, Inc.

    Abstract: A circuit arrangement includes a buffer, a height traversal circuit configured to generate a sequence of IFM height values in response to first control signals, a width traversal circuit configured to generate a sequence of IFM width values in response to second control signals, a control circuit, and an address generation circuit. The control circuit is configured to input an OFM height, an OFM width, a kernel height, and a kernel width; generate the first control signals at times based on the OFM height and the kernel height; and generate the second control signals at times based on the OFM width and the kernel width. The address generation circuit is configured to generate a sequence of addresses based on the sequences of IFM height values and IFM width values, provide the sequence of addresses to the buffer, and enable reading from the buffer.

    Systems for optimization of read-only memory (ROM)

    公开(公告)号:US10726175B1

    公开(公告)日:2020-07-28

    申请号:US16291952

    申请日:2019-03-04

    Applicant: Xilinx, Inc.

    Abstract: A memory optimization method includes identifying, within a circuit design, a memory having an arithmetic operator at an output side and/or an input side of the memory. The memory may include a read-only memory (ROM). In some examples, an input of the arithmetic operator includes a constant value. In some embodiments, the memory optimization method further includes absorbing a function of the arithmetic operator into the memory. By way of example, the absorbing the function includes modifying contents of the memory based on the function of the arithmetic operator to provide an updated memory and removing the arithmetic operator from the circuit design.

    Sparse matrix processing circuitry
    10.
    发明授权

    公开(公告)号:US10572409B1

    公开(公告)日:2020-02-25

    申请号:US15976722

    申请日:2018-05-10

    Applicant: Xilinx, Inc.

    Abstract: A memory arrangement can store a matrix of matrix data elements specified as index-value pairs that indicate row and column indices and associated values. First split-and-merge circuitry is coupled between the memory arrangement and a first set of FIFO buffers for reading the matrix data elements from the memory arrangement and putting the matrix data elements in the first set of FIFO buffers based on column indices. A pairing circuit is configured to read vector data elements, pair the vector data elements with the matrix data elements, and put the paired matrix and vector data elements in a second set of FIFO buffers based on column indices. Second split-and-merge circuitry is configured to read paired matrix and vector data elements from the second set of FIFO buffers and put the paired matrix and vector data elements in a third set of FIFO buffers based on row indices.

Patent Agency Ranking