-
公开(公告)号:US20220350651A1
公开(公告)日:2022-11-03
申请号:US17746201
申请日:2022-05-17
Applicant: Intel Corporation
Inventor: Joydeep Ray , Abhishek R. Appu , Altug Koker , Kamal Sinha , Balaji Vembu , Rajkishore Barik , Eriko Nurvitadhi , Nicolas Galoppo Von Borries , Tsung-Han Lin , Sanjeev Jahagirdar , Vasanth Ranganathan
Abstract: A mechanism is described for facilitating intelligent thread scheduling at autonomous machines. A method of embodiments, as described herein, includes detecting dependency information relating to a plurality of threads corresponding to a plurality of workloads associated with tasks relating to a processor including a graphics processor. The method may further include generating a tree of thread groups based on the dependency information, where each thread group includes multiple threads, and scheduling one or more of the thread groups associated a similar dependency to avoid dependency conflicts.
-
公开(公告)号:US11263720B2
公开(公告)日:2022-03-01
申请号:US16985329
申请日:2020-08-05
Applicant: Intel Corporation
Inventor: Saurabh Sharma , Abhishek Venkatesh , Travis T. Schluessler , Prasoonkumar Surti , Altug Koker , Aravindh V. Anantaraman , Pattabhiraman P. K. , Abhishek R. Appu , Joydeep Ray , Kamal Sinha , Vasanth Ranganathan , Bhushan M. Borole , Wenyin Fu , Eric J. Hoekstra , Linda L. Hurd
Abstract: A control surface tracks an individual cacheline in the original surface for frequent data values. If so, control surface bits are set. When reading a cacheline from memory, first the control surface bits are read. If they happen to be set, then the original memory read is skipped altogether and instead the bits from the control surface provide the value for the entire cacheline.
-
公开(公告)号:US11222392B2
公开(公告)日:2022-01-11
申请号:US16531763
申请日:2019-08-05
Applicant: Intel Corporation
Inventor: Prasoonkumar Surti , Narayan Srinivasa , Feng Chen , Joydeep Ray , Ben J. Ashbaugh , Nicolas C. Galoppo Von Borries , Eriko Nurvitadhi , Balaji Vembu , Tsung-Han Lin , Kamal Sinha , Rajkishore Barik , Sara S. Baghsorkhi , Justin E. Gottschlich , Altug Koker , Nadathur Rajagopalan Satish , Farshad Akhbari , Dukhwan Kim , Wenyin Fu , Travis T. Schluessler , Josh B. Mastronarde , Linda L. Hurd , John H. Feit , Jeffery S. Boles , Adam T. Lake , Karthik Vaidyanathan , Devan Burke , Subramaniam Maiyuran , Abhishek R. Appu
IPC: G06T1/20 , G06T1/60 , G09G5/36 , G06F3/06 , G06N3/08 , G06F3/14 , G06N3/04 , G06N3/063 , G09G5/00
Abstract: An apparatus to facilitate compute optimization is disclosed. The apparatus includes a memory device including a first integrated circuit (IC) including a plurality of memory channels and a second IC including a plurality of processing units, each coupled to a memory channel in the plurality of memory channels.
-
公开(公告)号:US20210373886A1
公开(公告)日:2021-12-02
申请号:US17443376
申请日:2021-07-26
Applicant: Intel Corporation
Inventor: Kevin Nealis , Anbang Yao , Xiaoming Chen , Elmoustapha Ould-Ahmed-Vall , Sara S. Baghsorkhi , Eriko Nurvitadhi , Balaji Vembu , Nicolas C. Galoppo Von Borries , Rajkishore Barik , Tsung-Han Lin , Kamal Sinha
Abstract: One embodiment provides for a compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including a multi-bit input value and a ternary weight associated with a neural network and an arithmetic logic unit including a multiplier, an adder, and an accumulator register. To execute the decoded instruction, the multiplier is to perform a multiplication operation on the multi-bit input based on the ternary weight to generate an intermediate product and the adder is to add the intermediate product to a value stored in the accumulator register and update the value stored in the accumulator register.
-
公开(公告)号:US20210349715A1
公开(公告)日:2021-11-11
申请号:US17319056
申请日:2021-05-12
Applicant: Intel Corporation
Inventor: Abhishek R. Appu , Altug Koker , Joydeep Ray , Kamal Sinha , Kiran C. Veernapu , Subramaniam Maiyuran , Prasoonkumar Surti , Guei-Yuan Lueh , David Puffer , Supratim Pal , Eric J. Hoekstra , Travis T. Schluessler , Linda L. Hurd
Abstract: In an example, an apparatus comprises a plurality of execution units, and a first general register file (GRF) communicatively couple to the plurality of execution units, wherein the first GRF is shared by the plurality of execution units. Other embodiments are also disclosed and claimed.
-
公开(公告)号:US11145105B2
公开(公告)日:2021-10-12
申请号:US16355364
申请日:2019-03-15
Applicant: Intel Corporation
Inventor: Prasoonkumar Surti , Arthur Hunter, Jr. , Kamal Sinha , Scott Janus , Brent Insko , Vasanth Ranganathan , Lakshminarayanan Striramassarma
Abstract: Embodiments are generally directed to multi-tile graphics processor rendering. An embodiment of an apparatus includes a memory for storage of data; and one or more processors including a graphics processing unit (GPU) to process data, wherein the GPU includes a plurality of GPU tiles, wherein, upon geometric data being assigned to each of a plurality of screen tiles, the apparatus is to transfer the geometric data to the plurality of GPU tiles.
-
公开(公告)号:US20210294649A1
公开(公告)日:2021-09-23
申请号:US17206514
申请日:2021-03-19
Applicant: Intel Corporation
Inventor: Chandrasekaran Sakthivel , Prasoonkumar Surti , John C. Weast , Sara S. Baghsorkhi , Justin E. Gottschlich , Abhishek R. Appu , Nicolas C. Galoppo Von Borries , Joydeep Ray , Narayan Srinivasa , Feng Chen , Ben J. Ashbaugh , Rajkishore Barik , Tsung-Han Lin , Kamal Sinha , Eriko Nurvitadhi , Balaji Vembu , Altug Koker
Abstract: In an example, an apparatus comprises a plurality of processing unit cores, a plurality of cache memory modules associated with the plurality of processing unit cores, and a machine learning model communicatively coupled to the plurality of processing unit cores, wherein the plurality of cache memory modules share cache coherency data with the machine learning model. Other embodiments are also disclosed and claimed.
-
公开(公告)号:US20210294373A1
公开(公告)日:2021-09-23
申请号:US16823221
申请日:2020-03-18
Applicant: Intel Corporation
Inventor: Navid Toosizadeh , Kamal Sinha , Altug Koker
Abstract: An all-digital closed-loop fine-grained control of voltage and frequency for running conditions of a compute machine such as graphic processor unit (GPU), central processing unit (CPU), or any other processing unit. The scheme optimizes the voltage margin and frequency on the fly according to desired programmable performance metrics. A mitigation response to droops is naturally built into the system and is equal to the cause rather than being excessive. The scheme is scalable and can be instantiated in different clusters for best results.
-
19.
公开(公告)号:US11080046B2
公开(公告)日:2021-08-03
申请号:US17169232
申请日:2021-02-05
Applicant: Intel Corporation
Inventor: Himanshu Kaul , Mark A. Anders , Sanu K. Mathew , Anbang Yao , Joydeep Ray , Ping T. Tang , Michael S. Strickland , Xiaoming Chen , Tatiana Shpeisman , Abhishek R. Appu , Altug Koker , Kamal Sinha , Balaji Vembu , Nicolas C. Galoppo Von Borries , Eriko Nurvitadhi , Rajkishore Barik , Tsung-Han Lin , Vasanth Ranganathan , Sanjeev Jahagirdar
IPC: G06F9/302 , G06F7/483 , G06N3/04 , G06F17/16 , G06F9/30 , G09G5/393 , G06F7/544 , G06F9/38 , G06N3/08 , G06N3/063 , G06N20/00 , G06T15/00
Abstract: A processing apparatus is provided comprising a multiprocessor having a multithreaded architecture. The multiprocessor can execute at least one single instruction to perform parallel mixed precision matrix operations. In one embodiment the apparatus includes a memory interface and an array of multiprocessors coupled to the memory interface. At least one multiprocessor in the array of multiprocessors is configured to execute a fused multiply-add instruction in parallel across multiple threads.
-
公开(公告)号:US10943325B2
公开(公告)日:2021-03-09
申请号:US16930841
申请日:2020-07-16
Applicant: Intel Corporation
Inventor: Eriko Nurvitadhi , Balaji Vembu , Tsung-Han Lin , Kamal Sinha , Rajkishore Barik , Nicolas C. Galoppo Von Borries
IPC: G06F17/16 , G06T1/20 , G06F9/30 , G06T1/60 , G06K9/62 , G06F12/0888 , G06F12/0815 , H03M7/30 , G06F9/48 , G06T15/00 , G06N3/04 , G06F9/38 , G06F12/0831 , G06F12/0811 , G06N3/08 , G06N20/00
Abstract: Techniques to improve performance of matrix multiply operations are described in which a compute kernel can specify one or more element-wise operations to perform on output of the compute kernel before the output is transferred to higher levels of a processor memory hierarchy.
-
-
-
-
-
-
-
-
-