Transaction ordering based on target address

    公开(公告)号:US12001352B1

    公开(公告)日:2024-06-04

    申请号:US17937395

    申请日:2022-09-30

    CPC classification number: G06F13/1621 G06F9/466

    Abstract: Techniques are provided to maintain data coherency for data transfers among data processing devices in a distributed computing environment. A data buffer in each data processing device can be mapped to an address range that is assigned to transactions that allow out-of-order completions, and a message buffer in each data processing device can be mapped to an address range that is assigned to transactions that follow transaction ordering. Thus, a transaction to store a set of data into the data buffer is completed before a transaction to write a synchronization message in the message buffer indicating that the set of data is stored in the data buffer based on the mapping irrespective of the transaction ordering indicated by each transaction.

    Dilated convolution using systolic array

    公开(公告)号:US11816559B2

    公开(公告)日:2023-11-14

    申请号:US17832039

    申请日:2022-06-03

    Abstract: In one example, a non-transitory computer readable medium stores instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to: load a first weight data element of an array of weight data elements from a memory into a systolic array; select a subset of input data elements from the memory into the systolic array to perform first computations of a dilated convolution operation, the subset being selected based on a rate of the dilated convolution operation and coordinates of the weight data element within the array of weight data elements; and control the systolic array to perform the first computations based on the first weight data element and the subset to generate first output data elements of an output data array. An example of a compiler that generates the instructions is also provided.

    EFFICIENT UTILIZATION OF PROCESSING ELEMENT ARRAY

    公开(公告)号:US20230359876A1

    公开(公告)日:2023-11-09

    申请号:US18352768

    申请日:2023-07-14

    CPC classification number: G06N3/063 G06N3/04

    Abstract: Generating instructions for programming a processing element array to implement a convolution operation can include determining that the convolution operation under-utilizes the processing element array. The convolution operation involves using the processing element array to perform a series of matrix multiplications between a set of filters and a set of input matrices. Each filter comprises a weight matrix. Each input matrix is assigned to a respective row in the processing element array. Under-utilization can be determined through detecting that less than a threshold number of rows would be used concurrently. In response to determining that the convolution operation under-utilizes the processing element array, instructions can be added for modifying the convolution operation to increase the number of rows used concurrently. The added instructions are executable to cause at least one input matrix to be processed in parallel across more rows compared to processing without modifying the convolution operation.

    PROCESSING FOR MULTIPLE INPUT DATA SETS
    135.
    发明公开

    公开(公告)号:US20230351186A1

    公开(公告)日:2023-11-02

    申请号:US18144129

    申请日:2023-05-05

    CPC classification number: G06N3/082 G06F3/0604 G06F3/0644 G06F3/0673 G06N3/045

    Abstract: Disclosed herein are techniques for performing multi-layer neural network processing for multiple contexts. In one embodiment, a computing engine is set in a first configuration to implement a second layer of a neural network and to process first data related to a first context to generate first context second layer output. The computing engine can be switched from the first configuration to a second configuration to implement a first layer of the neural network. The computing engine can be used to process second data related to a second context to generate second context first layer output. The computing engine can be set to a third configuration to implement a third layer of the neural network to process the first context second layer output and the second context first layer output to generate a first processing result of the first context and a second processing result of the second context.

    Multi-memory on-chip computational network

    公开(公告)号:US11741345B2

    公开(公告)日:2023-08-29

    申请号:US17033573

    申请日:2020-09-25

    Abstract: Provided are systems, methods, and integrated circuits for a neural network processing system. In various implementations, the system can include a first array of processing engines coupled to a first set of memory banks and a second array of processing engines coupled to a second set of memory banks. The first and second set of memory banks be storing all the weight values for a neural network, where the weight values are stored before any input data is received. Upon receiving input data, the system performs a task defined for the neural network. Performing the task can include computing an intermediate result using the first array of processing engines, copying the intermediate result to the second set of memory banks, and computing a final result using the second array of processing engines, where the final result corresponds to an outcome of performing the task.

    Multi-model training pipeline in distributed systems

    公开(公告)号:US11676021B1

    公开(公告)日:2023-06-13

    申请号:US17947355

    申请日:2022-09-19

    CPC classification number: G06N3/08 G06N3/045

    Abstract: A first worker node of a distributed system computes a first set of gradients using a first neural network model and a first set of weights associated with the first neural network model. The first set of gradients are transmitted from the first worker node to a second worker node of the distributed system. The second worker node computes a first set of synchronized gradients based on the first set of gradients. While the first set of synchronized gradients are being computed, the first worker node computes a second set of gradients using a second neural network model and a second set of weights associated with the second neural network model. The second set of gradients are transmitted from the first worker node to the second worker node. The second worker node computes a second set of synchronized gradients based on the second set of gradients.

    Matrix transpose hardware acceleration

    公开(公告)号:US11636569B1

    公开(公告)日:2023-04-25

    申请号:US17029609

    申请日:2020-09-23

    Inventor: Kun Xu Ron Diamant

    Abstract: In one example, an apparatus comprises: a buffer memory; and a memory access circuit configured to: fetch, from a first memory, a set of first groups of data elements of a first matrix, each first group of data elements being stored at consecutive memory addresses at the first memory; based on a first configuration, store the set of first groups of data elements at consecutive memory addresses or at non-consecutive memory addresses at the buffer memory; based on a second configuration that defines a memory address offset, fetch a set of second groups of the data elements from the buffer memory, each second group of the data elements being stored at consecutive memory addresses of the buffer memory, each second group being separated by the memory address offset in the buffer memory; and store each fetched second group at consecutive addresses of a destination memory to form a second matrix.

    SYSTOLIC ARRAY WITH INPUT REDUCTION TO MULTIPLE REDUCED INPUTS

    公开(公告)号:US20230004523A1

    公开(公告)日:2023-01-05

    申请号:US17363900

    申请日:2021-06-30

    Abstract: Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reducer can receive a particular input and generate multiple reduced inputs from the input. The reduced inputs can include reduced input data elements and/or a reduced weights. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide multiple reduced inputs with second shorter bit-length to the array. The systolic array may perform multiply-accumulate operations on each unique combination of the multiple reduced input data elements and the reduced weights to generate multiple partial outputs. The systolic array may sum the partial outputs to generate the output.

Patent Agency Ranking