Memory access operation in distributed computing system

    公开(公告)号:US11467992B1

    公开(公告)日:2022-10-11

    申请号:US17031668

    申请日:2020-09-24

    Abstract: In one example, an apparatus comprises: a local on-chip memory; a computation engine configured to generate local data and to store the local data at the local on-chip memory; and a controller. The apparatus is configured to be coupled with a second device via an interconnect, the second device comprising a local memory. The controller is configured to: fetch the local data from the local on-chip memory; fetch remote data generated by another device from a local off-chip memory; generate output data based on combining the local data and the remote data; and store, via the interconnect, the output data at the local memory of the second device.

    DILATED CONVOLUTION USING SYSTOLIC ARRAY

    公开(公告)号:US20220292163A1

    公开(公告)日:2022-09-15

    申请号:US17832039

    申请日:2022-06-03

    Abstract: In one example, a non-transitory computer readable medium stores instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to: load a first weight data element of an array of weight data elements from a memory into a systolic array; select a subset of input data elements from the memory into the systolic array to perform first computations of a dilated convolution operation, the subset being selected based on a rate of the dilated convolution operation and coordinates of the weight data element within the array of weight data elements; and control the systolic array to perform the first computations based on the first weight data element and the subset to generate first output data elements of an output data array. An example of a compiler that generates the instructions is also provided.

    Transpose operations using processing element array

    公开(公告)号:US11347480B2

    公开(公告)日:2022-05-31

    申请号:US17122136

    申请日:2020-12-15

    Abstract: Provided are integrated circuits and methods for transposing a tensor using processing element array operations. In some cases, it may be necessary to transpose elements of a tensor to perform a matrix operation. The tensor may be decomposed into blocks of data elements having dimensions consistent with the dimensions of a systolic array. An identity multiplication may be performed on each block of data elements loaded into a systolic array and the multiplication products summed in column partitions of a results buffer. The data elements in the column partitions of results buffer can then be mapped to row partitions of a buffer memory for further processing.

    Target port with distributed transactions

    公开(公告)号:US11138106B1

    公开(公告)日:2021-10-05

    申请号:US16836780

    申请日:2020-03-31

    Abstract: Provided are integrated circuit devices and methods for operating integrated circuit devices. In various examples, the integrated circuit device can include a target port operable to receive transactions from a master port. The target port can be configured with a multicast address range that is associated with a plurality of indices corresponding to memory banks of the device. When the target port receives a write transaction that has an address that is within the multicast address range, the target port can determine an index from the plurality of indices, and can use the index to determine a second address, which combines the index and the offset value with the address. The target port can then use the second address to write the data to the memory.

    GRADIENT COMPRESSION FOR DISTRIBUTED TRAINING

    公开(公告)号:US20210295168A1

    公开(公告)日:2021-09-23

    申请号:US16827444

    申请日:2020-03-23

    Inventor: Kun Xu Ron Diamant

    Abstract: Techniques for exchanging compressed gradient data within a distributed system are disclosed. A set of gradients are computed at a first worker node of the distributed system using a neural network model and a set of weights associated with the neural network model. Each of the set of gradients having a value less than a threshold is clipped, resulting in non-clipped data elements and clipped data elements. A mapping indicating which of the set of gradients correspond to non-clipped data elements and which of the set of gradients correspond to clipped data elements is generated. Compressed data is generated based on the non-clipped data elements. The mapping and the compressed data are transmitted from the first worker node to a second worker node of the distributed system

    TRANSPOSE OPERATIONS USING PROCESSING ELEMENT ARRAY

    公开(公告)号:US20210096823A1

    公开(公告)日:2021-04-01

    申请号:US17122136

    申请日:2020-12-15

    Abstract: Provided are integrated circuits and methods for transposing a tensor using processing element array operations. In some cases, it may be necessary to transpose elements of a tensor to perform a matrix operation. The tensor may be decomposed into blocks of data elements having dimensions consistent with the dimensions of a systolic array. An identity multiplication may be performed on each block of data elements loaded into a systolic array and the multiplication products summed in column partitions of a results buffer. The data elements in the column partitions of results buffer can then be mapped to row partitions of a buffer memory for further processing.

    Secure data processing
    107.
    发明授权

    公开(公告)号:US10956584B1

    公开(公告)日:2021-03-23

    申请号:US16141770

    申请日:2018-09-25

    Abstract: Systems and methods for performing neural network processing are provided. In one example, a system comprises a neural network processor comprising: a data decryption engine that receives encrypted data and decrypts the encrypted data, the encrypted data comprising at least one of: encrypted weights data, encrypted input data, or encrypted instruction data related to a neural network model; and a computing engine that receives the weights data and perform computations of neural network processing using the input data and the weights data and based on the instruction data.

    Assisted indirect memory addressing
    108.
    发明授权

    公开(公告)号:US10929063B1

    公开(公告)日:2021-02-23

    申请号:US16368538

    申请日:2019-03-28

    Abstract: Systems and methods for assisted indirect memory addressing are provided. Some computing systems move data between levels of a hierarchical memory system. To accommodate data movement for computing systems that do not natively support indirect addressing between levels of the memory hierarchy, a direct memory access (DMA) engine is used to fetch data. The DMA engine executes a first set of memory instructions that modify a second set of memory instructions to fetch data stored at one level of the memory hierarchy from dynamically computed indirect addresses stored in memory locations at another level of the memory hierarchy.

    Power reduction in processor pipeline by detecting zeros

    公开(公告)号:US10901492B1

    公开(公告)日:2021-01-26

    申请号:US16369696

    申请日:2019-03-29

    Abstract: Techniques are described for power reduction in a computer processor based on detection of whether data destined for input to an arithmetic logic unit (ALU) has a particular value. The data is written to a register prior to performing an arithmetic or logical operation using the data as an operand. Depending on a timing of when the data is supplied to the register, the determination is made before or after the data is written to the register, and a memory associated with the register is updated with a result of the determination. Contents of the memory are used to make a decision whether to allow the ALU to perform the arithmetic or logical operation. The memory can be implemented as a non-architectural register.

    Notifications in integrated circuits

    公开(公告)号:US10896001B1

    公开(公告)日:2021-01-19

    申请号:US16145050

    申请日:2018-09-27

    Abstract: Provided are integrated circuit devices and methods for operating integrated circuit devices. In various examples, an integrated circuit device can be operable to determine, at a point in time during operation of the integrated circuit device, to generate a notification. The notification can include a type and a timestamp indicating the point in time. The notification can also include information about an internal status of the integrated circuit at the point in time. The device can further selectin a queue from a plurality of queues in a processor memory of the computing system that includes the integrated circuit. The device can further generate a write transaction including the notification, where the write transaction is addressed to the queue. The device can further output the write transaction using a communication interface of the device.

Patent Agency Ranking