Cache architecture for a massively parallel processing array

    公开(公告)号:US12086066B1

    公开(公告)日:2024-09-10

    申请号:US18184536

    申请日:2023-03-15

    申请人: Cornami, Inc.

    IPC分类号: G06F12/0813

    CPC分类号: G06F12/0813 G06F2212/1041

    摘要: A cache architecture for an array of identical cores arranged in a grid. Each of the cores include interconnections to neighboring cores in the grid, a memory, and an algorithmic logic unit. A first core of the array is configured to receive a memory access request for data from at least one core of the array of cores configured to perform a computational operation. A second core of the array is configured to determine whether the requested data is present in a cache memory via a cache index including addresses in the cache memory. A third core of the array is configured as the cache memory. The memory of the third core is used as the cache memory. An address of the requested data from the cache index is passed to the third core to output the requested data.

    Method and system for compressing application data for operations on multi-core systems

    公开(公告)号:US11599367B2

    公开(公告)日:2023-03-07

    申请号:US16752239

    申请日:2020-01-24

    申请人: CORNAMI, INC.

    发明人: Tianfang Liu

    摘要: A system and method to compress application control data, such as weights for a layer of a convolutional neural network, is disclosed. A multi-core system for executing at least one layer of the convolutional neural network includes a storage device storing a compressed weight matrix of a set of weights of the at least one layer of the convolutional network and a decompression matrix. The compressed weight matrix is formed by matrix factorization and quantization of a floating point value of each weight to a floating point format. A decompression module is operable to obtain an approximation of the weight values by decompressing the compressed weight matrix through the decompression matrix. A plurality of cores executes the at least one layer of the convolutional neural network with the approximation of weight values to produce an inference output.

    Parallel Processing of Data Having Data Dependencies for Accelerating the Launch and Performance of Operating Systems and Other Computing Applications

    公开(公告)号:US20220058199A1

    公开(公告)日:2022-02-24

    申请号:US17467231

    申请日:2021-09-05

    申请人: Cornami, Inc.

    摘要: Representative embodiments are disclosed for a rapid and highly parallel decompression of compressed executable and other files, such as executable files for operating systems and applications, having compressed blocks including run length encoded (“RLE”) data having data-dependent references. An exemplary embodiment includes a plurality of processors or processor cores to identify a start or end of each compressed block; to partially decompress, in parallel, a selected compressed block into independent data, dependent (RLE) data, and linked dependent (RLE) data; to sequence the independent data, dependent (RLE) data, and linked dependent (RLE) data from a plurality of partial decompressions of a plurality of compressed blocks, to obtain data specified by the dependent (RLE) data and linked dependent (RLE) data, and to insert the obtained data into a corresponding location in an uncompressed file. The representative embodiments are also applicable to other types of data processing for applications having data dependencies.

    METHOD AND SYSTEM FOR ROBUST STREAMING OF DATA

    公开(公告)号:US20210297361A1

    公开(公告)日:2021-09-23

    申请号:US16825585

    申请日:2020-03-20

    申请人: Cornami, Inc.

    摘要: A method and system for providing robust streaming of data from a multi-core die is disclosed. The techniques include using a high bandwidth memory (HBM) device as retransmit buffers for large amounts of data to ensure robust communication in relatively high round trip-transmission time (RTT) transmission. Another technique is supporting two or more Ethernet ports between components to both transmit the same data packets on the two ports to insure robustness. Another technique is to use sequence numbers and send data packets from the different ports in a round robin fashion and reorder the packets upon receipt of an external device. Another technique is dynamically adding and removing paths for data packets between devices with multiple ports based on the quality of the path.

    METHOD AND SYSTEM FOR COMPRESSING APPLICATION DATA FOR OPERATIONS ON MULTI-CORE SYSTEMS

    公开(公告)号:US20210232407A1

    公开(公告)日:2021-07-29

    申请号:US16752239

    申请日:2020-01-24

    申请人: CORNAMI, INC.

    发明人: Tianfang LIU

    摘要: A system and method to compress application control data, such as weights for a layer of a convolutional neural network, is disclosed. A multi-core system for executing at least one layer of the convolutional neural network includes a storage device storing a compressed weight matrix of a set of weights of the at least one layer of the convolutional network and a decompression matrix. The compressed weight matrix is formed by matrix factorization and quantization of a floating point value of each weight to a floating point format. A decompression module is operable to obtain an approximation of the weight values by decompressing the compressed weight matrix through the decompression matrix. A plurality of cores executes the at least one layer of the convolutional neural network with the approximation of weight values to produce an inference output.

    METHOD AND SYSTEM FOR PROVIDING FAULT TOLERANT LAYOUT OF MASSIVELY PARALLEL PROCESSING ARRAY

    公开(公告)号:US20240160825A1

    公开(公告)日:2024-05-16

    申请号:US18054460

    申请日:2022-11-10

    申请人: Cornami, Inc.

    IPC分类号: G06F30/392

    CPC分类号: G06F30/392

    摘要: A system and method to create a robust topology of a layout of cores for performing a function on an array of cores arranged in a grid is disclosed. A defective core file of location of defective cores in the array and an optimal ideal topology of a configuration layout of at least some of the cores is input. The location of at least one defective core of the array is determined. At least some of the cores in the array of cores are assigned to the optimal initial topography of cores in the array. It is determined whether at least one defective core is in the optimal initial topography. The functions of the cores in the row and the column of the at least one defective core are assigned to additional neighboring cores in the array of cores to create the robust topology.

    RECONFIGURABLE REDUCED INSTRUCTION SET COMPUTER PROCESSOR ARCHITECTURE WITH FRACTURED CORES

    公开(公告)号:US20220179823A1

    公开(公告)日:2022-06-09

    申请号:US17681163

    申请日:2022-02-25

    申请人: Cornami Inc.

    IPC分类号: G06F15/78 G06F9/30 G06F15/80

    摘要: Systems and methods for reconfiguring a reduced instruction set computer processor architecture are disclosed. Exemplary implementations may: provide a primary processing core consisting of a RISC processor; provide a node wrapper associated with each of the plurality of secondary cores, the node wrapper comprising access memory associates with each secondary core, and a load/unload matrix associated with each secondary core; operate the architecture in a manner in which, for at least one core, data is read from and written to the at least cache memory in a control-centric mode; the secondary cores are selectively partitioned to operate in a streaming mode wherein data streams out of the corresponding secondary core into the main memory and other ones of the plurality of secondary cores.

    Parallel processing of data having data dependencies for accelerating the launch and performance of operating systems and other computing applications

    公开(公告)号:US11151139B2

    公开(公告)日:2021-10-19

    申请号:US16900381

    申请日:2020-06-12

    申请人: Cornami, Inc.

    摘要: Representative embodiments are disclosed for a rapid and highly parallel decompression of compressed executable and other files, such as executable files for operating systems and applications, having compressed blocks including run length encoded (“RLE”) data having data-dependent references. An exemplary embodiment includes a plurality of processors or processor cores to identify a start or end of each compressed block; to partially decompress, in parallel, a selected compressed block into independent data, dependent (RLE) data, and linked dependent (RLE) data; to sequence the independent data, dependent (RLE) data, and linked dependent (RLE) data from a plurality of partial decompressions of a plurality of compressed blocks, to obtain data specified by the dependent (RLE) data and linked dependent (RLE) data, and to insert the obtained data into a corresponding location in an uncompressed file. The representative embodiments are also applicable to other types of data processing for applications having data dependencies.