-
公开(公告)号:US12182035B2
公开(公告)日:2024-12-31
申请号:US17428529
申请日:2020-03-14
Applicant: Intel Corporation
Inventor: Altug Koker , Joydeep Ray , Elmoustapha Ould-Ahmed-Vall , Abhishek Appu , Aravindh Anantaraman , Valentin Andrei , Durgaprasad Bilagi , Varghese George , Brent Insko , Sanjeev Jahagirdar , Scott Janus , Pattabhiraman K , SungYe Kim , Subramaniam Maiyuran , Vasanth Ranganathan , Lakshminarayanan Striramassarma , Xinmin Tian
IPC: G06F12/00 , G06F12/0875 , G06F12/0891 , G06F12/123 , G06T1/60
Abstract: Systems and methods for improving cache efficiency and utilization are disclosed. In one embodiment, a graphics processor includes processing resources to perform graphics operations and a cache controller of a cache memory that is coupled to the processing resources. The cache controller is configured to set an initial aging policy using an aging field based on age of cache lines within the cache memory and to determine whether a hint or an instruction to indicate a level of aging has been received.
-
公开(公告)号:US20240111826A1
公开(公告)日:2024-04-04
申请号:US17937252
申请日:2022-09-30
Applicant: Intel Corporation
Inventor: Jiasheng Chen , Kevin Hurd , Changwon Rhee , Jorge Parra , Fangwen Fu , Theo Drane , William Zorn , Peter Caday , Gregory Henry , Guei-Yuan Lueh , Farzad Chehrazi , Amit Karande , Turbo Majumder , Xinmin Tian , Milind Girkar , Hong Jiang
CPC classification number: G06F17/16 , G06F7/5443 , G06T1/20
Abstract: An apparatus to facilitate hardware enhancements for double precision systolic support is disclosed. The apparatus includes matrix acceleration hardware having double-precision (DP) matrix multiplication circuitry including a multiplier circuits to multiply pairs of input source operands in a DP floating-point format; adders to receive multiplier outputs from the multiplier circuits and accumulate the multiplier outputs in a high precision intermediate format; an accumulator circuit to accumulate adder outputs from the adders with at least one of a third global source operand on a first pass of the DP matrix multiplication circuitry or an intermediate result from the first pass on a second pass of the DP matrix multiplication circuitry, wherein the accumulator circuit to generate an accumulator output in the high precision intermediate format; and a down conversion and rounding circuit to down convert and round an output of the second pass as final result in the DP floating-point format.
-
公开(公告)号:US11861761B2
公开(公告)日:2024-01-02
申请号:US17095590
申请日:2020-11-11
Applicant: Intel Corporation
Inventor: Subramaniam Maiyuran , Durgaprasad Bilagi , Joydeep Ray , Scott Janus , Sanjeev Jahagirdar , Brent Insko , Lidong Xu , Abhishek R. Appu , James Holland , Vasanth Ranganathan , Nikos Kaburlasos , Altug Koker , Xinmin Tian , Guei-Yuan Lueh , Changliang Wang
IPC: G06T1/60 , G06T1/20 , G06F12/0802 , G06N5/04
CPC classification number: G06T1/60 , G06F12/0802 , G06N5/04 , G06T1/20 , G06F2212/251
Abstract: Embodiments described herein are generally directed to improvements relating to power, latency, bandwidth and/or performance issues relating to GPU processing/caching. According to one embodiment, a system includes a producer intellectual property (IP) (e.g., a media IP), a compute core (e.g., a GPU or an AI-specific core of the GPU), a streaming buffer logically interposed between the producer IP and the compute core. The producer IP is operable to consume data from memory and output results to the streaming buffer. The compute core is operable to perform AI inference processing based on data consumed from the streaming buffer and output AI inference processing results to the memory.
-
公开(公告)号:US20230260075A1
公开(公告)日:2023-08-17
申请号:US18305904
申请日:2023-04-24
Applicant: Intel Corporation
Inventor: Subramaniam Maiyuran , Durgaprasad Bilagi , Joydeep Ray , Scott Janus , Sanjeev Jahagirdar , Brent Insko , Lidong Xu , Abhishek R. Appu , James Holland , Vasanth Ranganathan , Nikos Kaburlasos , Altug Koker , Xinmin Tian , Guei-Yuan Lueh , Changliang Wang
IPC: G06T1/60 , G06T1/20 , G06F12/0802 , G06N5/04
CPC classification number: G06T1/60 , G06T1/20 , G06F12/0802 , G06N5/04 , G06F2212/251
Abstract: Embodiments described herein are generally directed to improvements relating to power, latency, bandwidth and/or performance issues relating to GPU processing/caching. According to one embodiment, a state of multiple intellectual property (IP) cores that have access to a common cache via a central fabric is observed. Responsive to the observed state being indicative of performance of a standalone workload by a first IP core of the multiple IP cores, the common cache is treated as a local cache of the first IP core by powering off the central fabric and causing the first IP core to access the common cache via a low power access path between the first IP core and the common cache that is outside of the central fabric.
-
25.
公开(公告)号:US20230028666A1
公开(公告)日:2023-01-26
申请号:US17379121
申请日:2021-07-19
Applicant: Intel Corporation
Inventor: Joydeep Ray , Prathamesh Raghunath Shinde , Yue Qi , Abhishek R. Appu , Xinmin Tian , Vasanth Ranganathan , Ben J. Ashbaugh
IPC: G06F9/30
Abstract: Embodiments are directed to systems and methods for performing global memory atomics in a private cache of a sub-core of a GPU. An embodiment of a GPU includes multiple sub-cores each including a load/store pipeline. The load/store pipeline is operable to receive information specifying an atomic operation to be performed within a primary data cache of the load/store pipeline. The load/store pipeline is also operable to read data to be modified by the atomic operation into the primary data cache from a memory hierarchy shared by the multiple sub-cores. The load/store pipeline is further operable to produce an atomic result of the atomic operation by modifying the data within the primary data cache based on the atomic operation.
-
26.
公开(公告)号:US10452403B2
公开(公告)日:2019-10-22
申请号:US14866875
申请日:2015-09-26
Applicant: Intel Corporation
Inventor: Hong Wang , John P. Shen , Edward T. Grochowski , Richard A. Hankins , Gautham N. Chinya , Bryant E. Bigbee , Shivnandan D. Kaushik , Xiang Chris Zou , Per Hammarlund , Scott Dion Rodgers , Xinmin Tian , Anil Aggawal , Prashant Sethi , Baiju V. Patel , James P Held
Abstract: In an embodiment, a method is provided. The method includes managing user-level threads on a first instruction sequencer in response to executing user-level instructions on a second instruction sequencer that is under control of an application level program. A first user-level thread is run on the second instruction sequencer and contains one or more user level instructions. A first user level instruction has at least 1) a field that makes reference to one or more instruction sequencers or 2) implicitly references with a pointer to code that specifically addresses one or more instruction sequencers when the code is executed.
-
公开(公告)号:US20190005175A1
公开(公告)日:2019-01-03
申请号:US15636265
申请日:2017-06-28
Applicant: Intel Corporation
Inventor: Xinmin Tian , Geoff Lowney
IPC: G06F17/50
Abstract: Methods, apparatus, systems and articles of manufacture are disclosed to improve FPGA pipeline emulation efficiency on CPUs. An example disclosed apparatus includes a loop detector to identify a register shift loop in field programmable gate array (FPGA) code, an unroller to shift and store pipeline stages in the register shift loop to a temporary unroll array, an intermediate canceller to cancel out intermediate load and store values of the temporary unroll array to retain last shifted values of the pipeline stages, and a propagator to improve emulation efficiency of the FPGA code by generating a scalar loop of the retained last shifted values for a vectorization input.
-
公开(公告)号:US09910796B2
公开(公告)日:2018-03-06
申请号:US13844343
申请日:2013-03-15
Applicant: Intel Corporation
Inventor: Hong Wang , Per Hammarlund , Xiang Zou , John P. Shen , Xinmin Tian , Milind Girkar , Perry H. Wang , Piyush N. Desai
CPC classification number: G06F13/24 , G06F9/3005 , G06F9/3009 , G06F9/30145 , G06F9/3851 , G06F9/4843 , G06F11/3024 , G06F11/348 , G06F12/0875 , G06F2201/86 , G06F2201/88 , G06F2201/885 , G06F2212/452
Abstract: Method, apparatus, and program means for a programmable event driven yield mechanism that may activate other threads. In one embodiment, an apparatus includes execution resources to execute a plurality of instructions and a monitor to detect a condition indicating a low level of progress. The monitor can disrupt processing of a program by transferring to a handler in response to detecting the condition indicating a low level of progress. In another embodiment, thread switch logic may be coupled to a plurality of event monitors which monitor events within the multithreading execution logic. The thread switch logic switches threads based at least partially on a programmable condition of one or more of the performance monitors.
-
-
-
-
-
-
-