-
公开(公告)号:US12141683B2
公开(公告)日:2024-11-12
申请号:US17246341
申请日:2021-04-30
Applicant: Intel Corporation
Inventor: Arnab Raha , Debabrata Mohapatra , Gautham Chinya , Guruguhanathan Venkataramanan , Sang Kyun Kim , Deepak Mathaikutty , Raymond Sung , Cormac Brick
Abstract: Embodiments of the present disclosure are directed toward techniques and configurations enhancing the performance of hardware (HW) accelerators. Disclosed embodiments include static MAC scaling arrangement, which includes architectures and techniques for scaling the performance per unit of power and performance per area of HW accelerators. Disclosed embodiments also include dynamic MAC scaling arrangement, which includes architectures and techniques for dynamically scaling the number of active multiply-and-accumulate (MAC) within an HW accelerator based on activation and weight sparsity. Other embodiments may be described and/or claimed.
-
公开(公告)号:US20220188638A1
公开(公告)日:2022-06-16
申请号:US17684764
申请日:2022-03-02
Applicant: Intel Corporation
Inventor: Deepak Abraham Mathaikutty , Arnab Raha , Raymond Jit-Hung Sung , Debabrata Mohapatra
Abstract: An apparatus for convolution operations is provided. The apparatus includes a PE array, a datastore, writing modules, reading modules, and a controlling module. The PE array performs MAC operations. The datastore includes databanks, each of which stores data to be used by a column of the PE array. The writing modules transfer data from a memory to the datastore. The reading modules transfer data from the datastore to the PE array. Each reading module may transfer data to a particular column of the PE array. The controlling module can determine the rounds of a convolution operation. Each round includes MAC operations based on a weight. The controlling module controls the writing modules and reading modules so that the same data in a databank can be reused in multiple rounds. For different rounds, the controlling module can provide a reading module accesses to different databanks.
-
公开(公告)号:US20250036928A1
公开(公告)日:2025-01-30
申请号:US18907748
申请日:2024-10-07
Applicant: Intel Corporation
Inventor: Arnab Raha , Debabrata Mohapatra , Gautham Chinya , Guruguhanathan Venkataramanan , Sang Kyun Kim , Deepak Mathaikutty , Raymond Sung , Cormac Brick
Abstract: Embodiments of the present disclosure are directed toward techniques and configurations enhancing the performance of hardware (HW) accelerators. Disclosed embodiments include static MAC scaling arrangement, which includes architectures and techniques for scaling the performance per unit of power and performance per area of HW accelerators. Disclosed embodiments also include dynamic MAC scaling arrangement, which includes architectures and techniques for dynamically scaling the number of active multiply-and-accumulate (MAC) within an HW accelerator based on activation and weight sparsity. Other embodiments may be described and/or claimed.
-
公开(公告)号:US20250028565A1
公开(公告)日:2025-01-23
申请号:US18906648
申请日:2024-10-04
Applicant: Intel Corporation
Inventor: Debabrata Mohapatra , Arnab Raha , Deepak Mathaikutty , Raymond Sung , Cormac Brick
Abstract: Embodiments of the present disclosure are directed toward techniques and configurations enhancing the performance of hardware (HW) accelerators. The present disclosure provides a schedule-aware, dynamically reconfigurable, tree-based partial sum accumulator architecture for HW accelerators, wherein the depth of an adder tree in the HW accelerator is dynamically based on a dataflow schedule generated by a compiler. The adder tree depth is adjusted on a per-layer basis at runtime. Configuration registers, programmed via software, dynamically alter the adder tree depth for partial sum accumulation based on the dataflow schedule. By facilitating a variable depth adder tree during runtime, the compiler can choose a compute optimal dataflow schedule that minimizes the number of compute cycles needed to accumulate partial sums across multiple processing elements (PEs) within a PE array of a HW accelerator. Other embodiments may be described and/or claimed.
-
公开(公告)号:US20240231839A1
公开(公告)日:2024-07-11
申请号:US18416303
申请日:2024-01-18
Applicant: Intel Corporation
Inventor: Arnab Raha , Deepak Mathaikutty , Debabrata Mohapatra , Sang Kyun Kim , Gautham Chinya , Cormac Brick
CPC classification number: G06F9/445 , G06F9/3001 , G06F9/5027 , G06N20/00 , H03K19/177 , H03K19/20
Abstract: Methods, apparatus, systems, and articles of manufacture to load data into an accelerator are disclosed. An example apparatus includes data provider circuitry to load a first section and an additional amount of compressed machine learning parameter data into a processor engine. Processor engine circuitry executes a machine learning operation using the first section of compressed machine learning parameter data. A compressed local data re-user circuitry determines if a second section is present in the additional amount of compressed machine learning parameter data. The processor engine circuitry executes a machine learning operation using the second section when the second section is present in the additional amount of compressed machine learning parameter data.
-
公开(公告)号:US20210117197A1
公开(公告)日:2021-04-22
申请号:US17132895
申请日:2020-12-23
Applicant: Intel Corporation
Inventor: Steven Hsu , Amit Agarwal , Debabrata Mohapatra , Arnab Raha , Moongon Jung , Gautham Chinya , Ram Krishnamurthy
Abstract: Systems, apparatuses and methods identify a plurality of registers that are associated with a system-on-chip. The plurality of registers includes a first portion dedicated to write operations and a second portion dedicated to read operations. The technology writes data to the first portion of the plurality of registers, and transfers the data from the first portion to the second portion.
-
公开(公告)号:US12288153B2
公开(公告)日:2025-04-29
申请号:US18408716
申请日:2024-01-10
Applicant: Intel Corporation
Inventor: Gautham Chinya , Huichu Liu , Arnab Raha , Debabrata Mohapatra , Cormac Brick , Lance Hacking
Abstract: Methods and systems include a neural network system that includes a neural network accelerator. The neural network accelerator includes multiple processing engines coupled together to perform arithmetic operations in support of an inference performed using the deep neural network system. The neural network accelerator also includes a schedule-aware tensor data distribution circuitry or software that is configured to load tensor data into the multiple processing engines in a load phase, extract output data from the multiple processing engines in an extraction phase, reorganize the extracted output data, and store the reorganized extracted output data to memory.
-
公开(公告)号:US12242861B2
公开(公告)日:2025-03-04
申请号:US18416303
申请日:2024-01-18
Applicant: Intel Corporation
Inventor: Arnab Raha , Deepak Mathaikutty , Debabrata Mohapatra , Sang Kyun Kim , Gautham Chinya , Cormac Brick
Abstract: Methods, apparatus, systems, and articles of manufacture to load data into an accelerator are disclosed. An example apparatus includes data provider circuitry to load a first section and an additional amount of compressed machine learning parameter data into a processor engine. Processor engine circuitry executes a machine learning operation using the first section of compressed machine learning parameter data. A compressed local data re-user circuitry determines if a second section is present in the additional amount of compressed machine learning parameter data. The processor engine circuitry executes a machine learning operation using the second section when the second section is present in the additional amount of compressed machine learning parameter data.
-
公开(公告)号:US20220261623A1
公开(公告)日:2022-08-18
申请号:US17733692
申请日:2022-04-29
Applicant: Intel Corporation
Inventor: Raymond Jit-Hung Sung , Debabrata Mohapatra , Arnab Raha , Deepak Abraham Mathaikutty , Praveen Kumar Gupta
Abstract: An DNN accelerator includes a column of PEs and an external adder assembly for performing depthwise convolution. Each PE includes register files, multipliers, and an internal adder assembly. Each register file can store an operand (input operand, weight operand, etc.) of the depthwise convolution. The operand includes a sequence of elements, each of which corresponds to a different depthwise channel. A multiplier can perform a sequence of multiplications on two operands, e.g., an input operand and a weight operand, and generate a product operand. The internal adder assembly can accumulate product operands and generate an output operand of the PE. The output operand includes output elements, each of which corresponds to a different depthwise channel. The operands may be reused in different rounds of operations by the multipliers. The external adder assembly can accumulate output operands of multiple PEs and generate an output operand of the PE column.
-
10.
公开(公告)号:US20220083843A1
公开(公告)日:2022-03-17
申请号:US17534976
申请日:2021-11-24
Applicant: Intel Corporation
Inventor: Arnab Raha , Debabrata Mohapatra , Deepak Abraham Mathaikutty , Raymond Jit-Hung Sung , Cormac Michael Brick
Abstract: An apparatus is provided to access a weight vector of a layer in a sequence of layers in the DNN. The weight vector includes a first sequence of weights having different values. A bitmap is generated based on the weight vector. The bitmap includes a second sequence of bitmap elements. Each bitmap element corresponds to a different weight and has a value determined based at least on the value of the corresponding weight. The index of each bitmap element in the second sequence matches the index of the corresponding weight in the first sequence. A new bitmap is generated by rearranging the bitmap elements in the second sequence based on the values of the bitmap elements. The weight vector is rearranged based on the new bitmap. The rearranged weight vector is divided into subsets, each of which is assigned to a different PE for a MAC operation.
-
-
-
-
-
-
-
-
-