-
公开(公告)号:US11360809B2
公开(公告)日:2022-06-14
申请号:US16024343
申请日:2018-06-29
Applicant: Intel Corporation
Inventor: William Paul Griffin , Joshua Fryman , Jason Howard , Sang Phill Park , Robert Pawlowski , Michael Abbott , Scott Cline , Samkit Jain , Ankit More , Vincent Cave , Fabrizio Petrini , Ivan Ganev
Abstract: Embodiments of apparatuses, methods, and systems for scheduling tasks to hardware threads are described. In an embodiment, a processor includes a multiple hardware threads and a task manager. The task manager is to issue a task to a hardware thread. The task manager includes a hardware task queue to store a descriptor for the task. The descriptor is to include a field to store a value to indicate whether the task is a single task, a collection of iterative tasks, and a linked list of tasks.
-
公开(公告)号:US20200272596A1
公开(公告)日:2020-08-27
申请号:US16283795
申请日:2019-02-24
Applicant: INTEL CORPORATION
Inventor: Srinivasan Narayanamoorthy , Jayaram Bobba , Ankit More
IPC: G06F15/80
Abstract: The present disclosure is directed to systems and methods for decomposing systolic array circuitry to provide a plurality of N×N systolic sub-array circuits, apportioning a first tensor or array into a plurality of N×M first input arrays, and apportioning a second tensor or array into a plurality of M×N second input arrays. Systolic array control circuitry transfers corresponding ones of the first input arrays and second input arrays to a respective one of the plurality of N×N systolic sub-array circuits. As the elements included in the first input array and the elements included in the second input array are transferred to the systolic sub-array, the systolic sub-array performs one or more mathematical operations using the first and the second input arrays. The systems and methods beneficially improve the usage of the systolic array circuitry thereby advantageously reducing the number of clock cycles needed to perform a given number of calculations.
-
13.
公开(公告)号:US10476492B2
公开(公告)日:2019-11-12
申请号:US16201915
申请日:2018-11-27
Applicant: Intel Corporation
Inventor: Ankit More , Jason M. Howard , Robert Pawlowski , Fabrizio Petrini , Shaden Smith
IPC: H03K17/00 , G11C7/10 , H03K19/173
Abstract: Embodiments herein may present an integrated circuit including a switch, where the switch together with other switches forms a network of switches to perform a sequence of operations according to a structure of a collective tree. The switch includes a first number of input ports, a second number of output ports, a configurable crossbar to selectively couple the first number of input ports to the second number of output ports, and a computation engine coupled to the first number of input ports, the second number of output ports, and the crossbar. The computation engine of the switch performs an operation corresponding to an operation represented by a node of the collective tree. The switch further includes one or more registers to selectively configure the first number of input ports and the configurable crossbar. Other embodiments may be described and/or claimed.
-
公开(公告)号:US20220100508A1
公开(公告)日:2022-03-31
申请号:US17134251
申请日:2020-12-25
Applicant: Intel Corporation
Inventor: Robert Pawlowski , Ankit More , Vincent Cave , Sriram Aananthakrishnan , Jason M. Howard , Joshua B. Fryman
Abstract: Embodiments of apparatuses and methods for copying and operating on matrix elements are described. In embodiments, an apparatus includes a hardware instruction decoder to decode a single instruction and execution circuitry, coupled to hardware instruction decoder, to perform one or more operations corresponding to the single instruction. The single instruction has a first operand to reference a base address of a first representation of a source matrix and a second operand to reference a base address of second representation of a destination matrix. The one or more operations include copying elements of the source matrix to corresponding locations in the destination matrix and filling empty elements of the destination matrix with a single value.
-
公开(公告)号:US20200310795A1
公开(公告)日:2020-10-01
申请号:US16369846
申请日:2019-03-29
Applicant: INTEL CORPORATION
Inventor: Joshua Fryman , Ankit More , Jason Howard , Robert Pawlowski , Yigit Demir , Nick Pepperling , Fabrizio Petrini , Sriram Aananthakrishnan , Shaden Smith
Abstract: The present disclosure is directed to systems and methods of performing one or more broadcast or reduction operations using direct memory access (DMA) control circuitry. The DMA control circuitry executes a modified instruction set architecture (ISA) that facilitates the broadcast distribution of data to a plurality of destination addresses in system memory circuitry. The broadcast instruction may include broadcast of a single data value to each destination address. The broadcast instruction may include broadcast of a data array to each destination address. The DMA control circuitry may also execute a reduction instruction that facilitates the retrieval of data from a plurality of source addresses in system memory and performing one or more operations using the retrieved data. Since the DMA control circuitry, rather than the processor circuitry performs the broadcast and reduction operations, system speed and efficiency is beneficially enhanced.
-
16.
公开(公告)号:US20190109590A1
公开(公告)日:2019-04-11
申请号:US16201915
申请日:2018-11-27
Applicant: Intel Corporation
Inventor: Ankit More , Jason M. Howard , Robert Pawlowski , Fabrizio Petrini , Shaden Smith
IPC: H03K17/00 , H03K19/173 , G11C7/10
CPC classification number: H03K17/005 , G11C7/1006 , H03K17/007 , H03K19/1733
Abstract: Embodiments herein may present an integrated circuit including a switch, where the switch together with other switches forms a network of switches to perform a sequence of operations according to a structure of a collective tree. The switch includes a first number of input ports, a second number of output ports, a configurable crossbar to selectively couple the first number of input ports to the second number of output ports, and a computation engine coupled to the first number of input ports, the second number of output ports, and the crossbar. The computation engine of the switch performs an operation corresponding to an operation represented by a node of the collective tree. The switch further includes one or more registers to selectively configure the first number of input ports and the configurable crossbar. Other embodiments may be described and/or claimed.
-
17.
公开(公告)号:US20180285252A1
公开(公告)日:2018-10-04
申请号:US15477072
申请日:2017-04-01
Applicant: Intel Corporation
Inventor: Kon-Woo Kwon , Vivek Kozhikkottu , Sang Phill Park , Ankit More , William P. Griffin , Robert Pawlowski , Jason M. Howard , Joshua B. Fryman
IPC: G06F12/02 , G06F12/0802 , G06F12/0846 , G11C7/10 , G06F12/06
Abstract: Optimized memory access bandwidth devices, systems, and methods for processing low spatial locality data are disclosed and described. A system memory is divided into a plurality of memory subsections, where each memory subsection is communicatively coupled to an independent memory channel to a memory controller. Memory access requests from a processor are thereby sent by the memory controller to only the appropriate memory subsection.
-
公开(公告)号:US11106494B2
公开(公告)日:2021-08-31
申请号:US16147302
申请日:2018-09-28
Applicant: Intel Corporation
Inventor: Robert Pawlowski , Ankit More , Jason M. Howard , Joshua B. Fryman , Tina C. Zhong , Shaden Smith , Sowmya Pitchaimoorthy , Samkit Jain , Vincent Cave , Sriram Aananthakrishnan , Bharadwaj Krishnamurthy
Abstract: Disclosed embodiments relate to an improved memory system architecture for multi-threaded processors. In one example, a system includes a system comprising a multi-threaded processor core (MTPC), the MTPC comprising: P pipelines, each to concurrently process T threads; a crossbar to communicatively couple the P pipelines; a memory for use by the P pipelines, a scheduler to optimize reduction operations by assigning multiple threads to generate results of commutative arithmetic operations, and then accumulate the generated results, and a memory controller (MC) to connect with external storage and other MTPCs, the MC further comprising at least one optimization selected from: an instruction set architecture including a dual-memory operation; a direct memory access (DMA) engine; a buffer to store multiple pending instruction cache requests; multiple channels across which to stripe memory requests; and a shadow-tag coherency management unit.
-
公开(公告)号:US11061742B2
公开(公告)日:2021-07-13
申请号:US16019685
申请日:2018-06-27
Applicant: Intel Corporation
Inventor: Robert Pawlowski , Ankit More , Shaden Smith , Sowmya Pitchaimoorthy , Samkit Jain , Vincent Cavé , Sriram Aananthakrishnan , Jason M. Howard , Joshua B. Fryman
Abstract: In one embodiment, a first processor core includes: a plurality of execution pipelines each to execute instructions of one or more threads; a plurality of pipeline barrier circuits coupled to the plurality of execution pipelines, each of the plurality of pipeline barrier circuits associated with one of the plurality of execution pipelines to maintain status information for a plurality of barrier groups, each of the plurality of barrier groups formed of at least two threads; and a core barrier circuit to control operation of the plurality of pipeline barrier circuits and to inform the plurality of pipeline barrier circuits when a first barrier has been reached by a first barrier group of the plurality of barrier groups. Other embodiments are described and claimed.
-
公开(公告)号:US11003619B2
公开(公告)日:2021-05-11
申请号:US16283795
申请日:2019-02-24
Applicant: INTEL CORPORATION
Inventor: Srinivasan Narayanamoorthy , Jayaram Bobba , Ankit More
Abstract: The present disclosure is directed to systems and methods for decomposing systolic array circuitry to provide a plurality of N×N systolic sub-array circuits, apportioning a first tensor or array into a plurality of N×M first input arrays, and apportioning a second tensor or array into a plurality of M×N second input arrays. Systolic array control circuitry transfers corresponding ones of the first input arrays and second input arrays to a respective one of the plurality of N×N systolic sub-array circuits. As the elements included in the first input array and the elements included in the second input array are transferred to the systolic sub-array, the systolic sub-array performs one or more mathematical operations using the first and the second input arrays. The systems and methods beneficially improve the usage of the systolic array circuitry thereby advantageously reducing the number of clock cycles needed to perform a given number of calculations.
-
-
-
-
-
-
-
-
-