-
公开(公告)号:US11847185B2
公开(公告)日:2023-12-19
申请号:US17485055
申请日:2021-09-24
Applicant: Intel Corporation
Inventor: Dan Baum , Chen Koren , Elmoustapha Ould-Ahmed-Vall , Michael Espig , Christopher J. Hughes , Raanan Sade , Robert Valentine , Mark J. Charney , Alexander F. Heinecke
CPC classification number: G06F17/16 , G06F9/3001 , G06F9/3016 , G06F9/30101 , G06F9/3802
Abstract: Disclosed embodiments relate to accelerating multiplication of sparse matrices. In one example, a processor is to fetch and decode an instruction having fields to specify locations of first, second, and third matrices, and an opcode indicating the processor is to multiply and accumulate matching non-zero (NZ) elements of the first and second matrices with corresponding elements of the third matrix, and executing the decoded instruction as per the opcode to generate NZ bitmasks for the first and second matrices, broadcast up to two NZ elements at a time from each row of the first matrix and each column of the second matrix to a processing engine (PE) grid, each PE to multiply and accumulate matching NZ elements of the first and second matrices with corresponding elements of the third matrix. Each PE further to store an NZ element for use in a subsequent multiplications.
-
公开(公告)号:US11513893B2
公开(公告)日:2022-11-29
申请号:US17128414
申请日:2020-12-21
Applicant: Intel Corporation
Inventor: Somnath Paul , Charles Augustine , Chen Koren , George Shchupak , Muhammad M. Khellah
Abstract: A system includes a compute circuit that preemptively performs a computation on a data word before receiving an indication of data errors from an error checking and correction (ECC) circuit. The ECC circuit reads the data word from a memory array and performs error detection and error correction on the data word. The compute circuit reads the data word and performs the computation on the data word to generate an output value, without waiting for the ECC circuit to check and correct the data word. In response to error detection in the data word by the ECC circuit, the compute circuit delays outputting the output value until correction of the output value in accordance with the error detection by the ECC circuit.
-
公开(公告)号:US20210109809A1
公开(公告)日:2021-04-15
申请号:US17128414
申请日:2020-12-21
Applicant: Intel Corporation
Inventor: Somnath Paul , Charles Augustine , Chen Koren , George Shchupak , Muhammad M. Khellah
Abstract: A system includes a compute circuit that preemptively performs a computation on a data word before receiving an indication of data errors from an error checking and correction (ECC) circuit. The ECC circuit reads the data word from a memory array and performs error detection and error correction on the data word. The compute circuit reads the data word and performs the computation on the data word to generate an output value, without waiting for the ECC circuit to check and correct the data word. In response to error detection in the data word by the ECC circuit, the compute circuit delays outputting the output value until correction of the output value in accordance with the error detection by the ECC circuit.
-
公开(公告)号:US11450672B2
公开(公告)日:2022-09-20
申请号:US16859600
申请日:2020-04-27
Applicant: Intel Corporation
Inventor: Charles Augustine , Somnath Paul , Muhammad M. Khellah , Chen Koren
IPC: G11C17/16 , H01L27/11 , G11C11/418 , G11C11/419 , G11C11/412
Abstract: An ultra-deep compute Static Random Access Memory (SRAM) with high compute throughput and multi-directional data transfer capability is provided. Compute units are placed in both horizontal and vertical directions to achieve a symmetric layout while enabling communication between the compute units. An SRAM array supports simultaneous read and write to the left and right section of the same SRAM subarray by duplicating pre-decoding logic inside the SRAM array. This allows applications with non-overlapping read and write address spaces to have twice the bandwidth as compared to a baseline SRAM array.
-
5.
公开(公告)号:US10929503B2
公开(公告)日:2021-02-23
申请号:US16230814
申请日:2018-12-21
Applicant: Intel Corporation
Inventor: Omid Azizi , Chen Koren , Nitin Garegrat
Abstract: An apparatus and method for a masked multiply instruction to support neural network pruning operations. For example, one embodiment of a processor comprises: a decoder to decode a matrix multiplication with masking (GEMM) instruction identifying a destination matrix register to store a result, and source registers storing an A-matrix, a B-matrix, and a matrix mask; execution circuitry to execute the GEMM instruction, the execution circuitry to multiply a plurality of B-matrix elements with a plurality of A-matrix elements, each of the B-matrix elements associated with a mask value in the matrix mask, wherein if the mask value is set to a first value, then the execution circuitry is to multiply the B-matrix element with one or more of the A-matrix elements to generate a first partial result, and if the mask value is set to a second value, then the execution circuitry is to multiply an alternate B-matrix element with a one or more of the A-matrix elements to generate a second partial result.
-
公开(公告)号:US10509846B2
公开(公告)日:2019-12-17
申请号:US15840552
申请日:2017-12-13
Applicant: Intel Corporation
Inventor: Chen Koren , Dan Baum
Abstract: An accelerator for increasing the processing speed of a processor. The accelerator operates in two distinct modes. In a first mode for dense layer processing, row data sets and column data sets are sent to a multiplier for multiplication. In a second mode for sparse layer processing compressed row data sets are received by a row multiplexer and compressed column data sets are received by a column multiplexer. Each multiplexer is configured to compare the indexes of data sets with one another to determine matching indexes. When indexes match, the matching data sets are selected and sent to the multiplier for multiplication. When indexes do not match, data sets are stored in memory devices for subsequent cycles.
-
公开(公告)号:US20190042538A1
公开(公告)日:2019-02-07
申请号:US15840552
申请日:2017-12-13
Applicant: Intel Corporation
Inventor: Chen Koren , Dan Baum
Abstract: An accelerator for increasing the processing speed of a processor. The accelerator operates in two distinct modes. In a first mode for dense layer processing, row data sets and column data sets are sent to a multiplier for multiplication. In a second mode for sparse layer processing compressed row data sets are received by a row multiplexer and compressed column data sets are received by a column multiplexer. Each multiplexer is configured to compare the indexes of data sets with one another to determine matching indexes. When indexes match, the matching data sets are selected and sent to the multiplier for multiplication. When indexes do not match, data sets are stored in memory devices for subsequent cycles.
-
公开(公告)号:US12287843B2
公开(公告)日:2025-04-29
申请号:US18502291
申请日:2023-11-06
Applicant: Intel Corporation
Inventor: Dan Baum , Chen Koren , Elmoustapha Ould-Ahmed-Vall , Michael Espig , Christopher J. Hughes , Raanan Sade , Robert Valentine , Mark J. Charney , Alexander F. Heinecke
Abstract: Disclosed embodiments relate to accelerating multiplication of sparse matrices. In one example, a processor is to fetch and decode an instruction having fields to specify locations of first, second, and third matrices, and an opcode indicating the processor is to multiply and accumulate matching non-zero (NZ) elements of the first and second matrices with corresponding elements of the third matrix, and executing the decoded instruction as per the opcode to generate NZ bitmasks for the first and second matrices, broadcast up to two NZ elements at a time from each row of the first matrix and each column of the second matrix to a processing engine (PE) grid, each PE to multiply and accumulate matching NZ elements of the first and second matrices with corresponding elements of the third matrix. Each PE further to store an NZ element for use in a subsequent multiplications.
-
-
-
-
-
-
-