-
公开(公告)号:US11307866B2
公开(公告)日:2022-04-19
申请号:US16698998
申请日:2019-11-28
发明人: Shaoli Liu , Shengyuan Zhou , Zidong Du
IPC分类号: G06F9/302 , G06F17/16 , G06F9/38 , G06F9/30 , G06F3/01 , G06F9/48 , G06F9/50 , G06F9/54 , G06F11/07 , G06F11/10 , G06F11/30 , G06F12/0875 , G06K9/62 , G06N3/04 , G06N3/063 , G06V40/16 , G06F7/57 , G06F7/544 , G06F1/324
摘要: The disclosure provides a data processing device and method. The data processing device may include: a task configuration information storage unit and a task queue configuration unit. The task configuration information storage unit is configured to store configuration information of tasks. The task queue configuration unit is configured to configure a task queue according to the configuration information stored in the task configuration information storage unit. According to the disclosure, a task queue may be configured according to the configuration information.
-
公开(公告)号:US11221879B2
公开(公告)日:2022-01-11
申请号:US16919968
申请日:2020-07-02
申请人: Google LLC
摘要: Methods, systems, and apparatus for scheduling first-in-first-out instructions are described. In one aspect, a method includes receiving data representing code of a program to be executed by a processing unit comprising hardware processors. For each of one or more of the hardware processors, an order of independent groups of first-in-first-out (FIFO) instructions for execution by the hardware processor is identified in the data representing the code of the program. For each independent group of FIFO instructions for execution by the hardware processor, a path length metric that represents how long it will take to reach an end of the program from the independent group of FIFO instructions is determined. A new order of the independent groups of FIFO instructions for execution by the hardware processor is generated based at least on the path length metric for each independent group of FIFO instructions for execution by the hardware processor.
-
公开(公告)号:US11016764B2
公开(公告)日:2021-05-25
申请号:US16843015
申请日:2020-04-08
申请人: Google LLC
发明人: William Lacy , Gregory Michael Thorson , Christopher Aaron Clark , Norman Paul Jouppi , Thomas Norrie , Andrew Everett Phelps
IPC分类号: G06F9/302 , G06F9/312 , G06F15/80 , G06F13/40 , G06F7/57 , G06N3/063 , G06N20/00 , G06F17/16 , G06F9/30 , G06F9/38 , G06F7/58 , G06F13/36 , G06F13/42
摘要: A vector processing unit is described, and includes processor units that each include multiple processing resources. The processor units are each configured to perform arithmetic operations associated with vectorized computations. The vector processing unit includes a vector memory in data communication with each of the processor units and their respective processing resources. The vector memory includes memory banks configured to store data used by each of the processor units to perform the arithmetic operations. The processor units and the vector memory are tightly coupled within an area of the vector processing unit such that data communications are exchanged at a high bandwidth based on the placement of respective processor units relative to one another, and based on the placement of the vector memory relative to each processor unit.
-
4.
公开(公告)号:US10929779B1
公开(公告)日:2021-02-23
申请号:US16420092
申请日:2019-05-22
发明人: Avinash Sodani , Gopal Nalamalapu
IPC分类号: G06F9/302 , G06F9/52 , G06F15/80 , G06N5/04 , G06N20/00 , G06F9/38 , G06F12/0804 , G06F15/78 , G06F9/50 , G06F9/30
摘要: A system to support a machine learning (ML) operation comprises a core configured to receive and interpret commands into a set of instructions for the ML operation and a memory unit configured to maintain data for the ML operation. The system further comprises an inference engine having a plurality of processing tiles, each comprising an on-chip memory (OCM) configured to maintain data for local access by components in the processing tile and one or more processing units configured to perform tasks of the ML operation on the data in the OCM. The system also comprises an instruction streaming engine configured to distribute the instructions to the processing tiles to control their operations and to synchronize data communication between the core and the inference engine so that data transmitted between them correctly reaches the corresponding processing tiles while ensuring coherence of data shared and distributed among the core and the OCMs.
-
公开(公告)号:US10747531B1
公开(公告)日:2020-08-18
申请号:US15944315
申请日:2018-04-03
申请人: Xilinx, Inc.
发明人: Jan Langer , Baris Ozgul , Juan J. Noguera Serra , Goran HK Bilski , Tim Tuan
摘要: An example core for a data processing engine (DPE) includes a register file, a processor, coupled to the register file. The processor includes a multiply-accumulate (MAC) circuit, and permute circuitry coupled between the register file and the MAC circuit, the permute circuitry configured to concatenate at least one pair of outputs of the register file to provide at least one input to the MAC circuit. The core further includes an instruction decoder, coupled to the processor, configured to decode a very large instruction word (VLIW) to set a plurality of parameters of the processor, the plurality of parameters including first parameters of the permute circuitry and second parameters of the MAC circuit.
-
公开(公告)号:US10656943B2
公开(公告)日:2020-05-19
申请号:US15024095
申请日:2014-09-17
发明人: David Van Kampen
摘要: According to an aspect, a digital signal processor obtains a program instruction, selects a first real valued input or a second real valued input as a given real valued input (the first and second real valued inputs organized as adjacent elements of a first input vector), depending on an instruction type. The processor performs an arithmetic operation on the selected real valued input to provide a real valued result, and provides a first real valued output and a second real valued output during a first operation cycle (organized as adjacent elements of a second output vector).The real valued result is provided as the first real valued output and as the second real valued output, depending on the instruction type, and the second output vector is a real valued second output vector for real-complex multiplication with a complex valued third vector.
-
公开(公告)号:US10592241B2
公开(公告)日:2020-03-17
申请号:US16171291
申请日:2018-10-25
发明人: Xiao Zhang , Shaoli Liu , Tianshi Chen , Yunji Chen
IPC分类号: G06F9/302 , G06F15/76 , G06N3/08 , G06F9/30 , G06F17/16 , G06F9/38 , G06F7/78 , G06F15/80 , G06N3/04 , G06F9/355
摘要: Aspects for matrix multiplication in neural network are described herein. The aspects may include a master computation module configured to receive a first matrix and transmit a row vector of the first matrix. In addition, the aspects may include one or more slave computation modules respectively configured to store a column vector of a second matrix, receive the row vector of the first matrix, and multiply the row vector of the first matrix with the stored column vector of the second matrix to generate a result element. Further, the aspects may include an interconnection unit configured to combine the one or more result elements generated respectively by the one or more slave computation modules to generate a row vector of a result matrix and transmit the row vector of the result matrix to the master computation module.
-
公开(公告)号:US10585973B2
公开(公告)日:2020-03-10
申请号:US16172592
申请日:2018-10-26
发明人: Jinhua Tao , Tian Zhi , Shaoli Liu , Tianshi Chen , Yunji Chen
IPC分类号: G06F9/302 , G06F15/76 , G06F7/50 , G06N3/04 , G06F17/16 , G06F9/22 , G06F9/30 , G06F9/38 , G06N3/063 , G06F9/34 , G06F15/80 , G06F7/507
摘要: Aspects for vector operations in neural network are described herein. The aspects may include a vector caching unit configured to store a first vector and a second vector, wherein the first vector includes one or more first elements and the second vector includes one or more second elements. The aspects may further include one or more adders and a combiner. The one or more adders may be configured to respectively add each of the first elements to a corresponding one of the second elements to generate one or more addition results. The combiner may be configured to combine a combiner configured to combine the one or more addition results into an output vector.
-
公开(公告)号:US10241801B2
公开(公告)日:2019-03-26
申请号:US15390194
申请日:2016-12-23
申请人: Intel Corporation
发明人: Jayesh Iyer , Sergey P. Scherbinin , Alexander Y. Ostanevich , Dmitry M. Maslennikov , Denis G. Motin , Alexander V. Ermolovich , Andrey Chudnovets , Sergey A. Rozhkov , Boris A. Babayan
摘要: An apparatus includes a register file and a binary translator to create a plurality of strands and a plurality of iteration windows, where each iteration window of the plurality of iteration windows is allocated a set of continuous registers of the register file. The apparatus further includes a buffer to store strand documentation for a strand from the plurality of strands, where the strand documentation for the strand is to include an indication of a current register base for the strand. The apparatus further includes an execution circuit to execute an instruction to update the current register base for the strand in the strand documentation for the strand based on a fixed step value and an iteration window size.
-
公开(公告)号:US10191749B2
公开(公告)日:2019-01-29
申请号:US15301206
申请日:2015-12-24
申请人: INTEL CORPORATION
摘要: Single Instruction, Multiple Data (SIMD) technologies are described. A processing device can include a processor core and a memory. The processor core can receive, from a software application, a request to perform an operation on a first set of variables that includes a first input value and a register value and perform the operation on a second set of variables that includes a second input value and the first register value. The processor core can vectorize the operation on the first set of variables and the second set of variables. The processor core can perform the operation on the first set of variables and the second set of variables in parallel to obtain a first operation value and a second operation value. The processor core can perform a horizontal add operation on the first operation value and the second operation value and write the result to memory.
-
-
-
-
-
-
-
-
-