摘要:
In a neural network unit, each neural processing unit (NPU) of an array of N NPUs receives respective first and second upper and lower bytes of 2N bytes received from first and second RAMs. In a first mode, each NPU sign-extends the first upper byte to form a first 16-bit word and performs an arithmetic operation on the first 16-bit word and a second 16-bit word formed by the second upper and lower bytes. In a second mode, each NPU sign-extends the first lower byte to form a third 16-bit word and performs the arithmetic operation on the third 16-bit word and the second 16-bit word formed by the second upper and lower bytes. In a third mode, each NPU performs the arithmetic operation on a fourth 16-bit word formed by the first upper and lower bytes and the second 16-bit word formed by the second upper and lower bytes.
摘要:
A processor comprising a plurality of processing cores, a last level cache memory (LLC) shared by the plurality of processing cores, and a neural network unit (NNU) comprising an array of neural processing units (NPU) and a memory array. The LLC comprises a plurality of slices. To transition from a first mode in which the memory array operates to store neural network weights read by the plurality of NPUs to a second mode in which the memory array operates as a slice of the LLC in addition to the plurality of slices, the processor write-back-invalidates the LLC and updates a hashing algorithm to include the memory array as a slice of the LLC in addition to the plurality of slices. To transition from the second mode to the first mode, the processor write-back-invalidates the LLC and updates the hashing algorithm to exclude the memory array from the LLC.
摘要:
A neural network unit convolves an H×W×C input with F R×S×C filters to generate F Q×P outputs. N processing units (PU) each have a register receiving a respective word of an N-word row of a second memory and multiplexed-register selectively receiving a respective word of an N-word row of a first memory or word rotated from an adjacent PU multiplexed-register. H first memory rows hold input blocks of B words each of channels of respective 2-dimensional input row slices. R×S×C second memory rows hold filter blocks of B words each holding P copies of a filter weight. B is the smallest factor of N greater than W. The PU blocks multiply-accumulate input blocks and filter blocks in column-channel-row order; they read a row of input blocks and rotate it around the N PUs while performing multiply-accumulate operations so each PU block receives each input block before reading another row.
摘要:
A processor including a front end, at least one load pipeline, and a memory system that further includes a programmable prefetcher for prefetching information from an external memory. The front end converts fetched program instructions into microinstructions including load microinstructions and dispatches microinstructions for execution. The load pipeline executes dispatched load microinstructions and provides load requests to the memory system. The programmable prefetcher includes a load monitor, a programmable prefetch engine, and a prefetch requester. The load monitor tracks the load requests. The prefetch engine is configured to be programmed by at least one prefetch program to operate as a programmed prefetcher, such that during operation of the processor, the programmed prefetcher generates at least one prefetch address based on the load requests issued by the processor. The prefetch requester submits the at least one prefetch address to prefetch information from the memory system.
摘要:
A neural network unit has a first memory that holds elements of a data matrix and a second memory that holds elements of a convolution kernel. An array of neural processing units (NPU) each have a multiplexed register that receives a corresponding element of a row from the first memory and that also receives the multiplexed register output of an adjacent NPU. A register receives a corresponding element of a row from the second memory. An arithmetic unit receives the outputs of the register, the multiplexed register and an accumulator and performs a multiply-accumulate operation on them. For each sub-matrix of a plurality of sub-matrices of the data matrix, each arithmetic unit selectively receives either the element from the first memory or the adjacent NPU multiplexed register output and performs a series of the multiply-accumulate operations to accumulate into the accumulator a convolution of the sub-matrix with the convolution kernel.
摘要:
A processor includes a front-end portion that issues instructions to execution units that execute the issued instructions. A hardware neural network unit (NNU) execution unit includes a first memory that holds data words associated with artificial neural networks (ANN) neuron outputs, a second memory that holds weight words associated with connections between ANN neurons, and a third memory that holds a program comprising NNU instructions that are distinct, with respect to their instruction set, from the instructions issued to the NNU by the front-end portion of the processor. The program performs ANN-associated computations on the data and weight words. A first instruction instructs the NNU to transfer NNU instructions of the program from architectural general purpose registers to the third memory. A second instruction instructs the NNU to invoke the program stored in the third memory.
摘要:
A processor has an instruction fetch unit that fetches ISA instructions from memory and execution units that perform operations on instruction operands to generate results according to the processor's ISA. A hardware neural network unit (NNU) execution unit performs computations associated with artificial neural networks (ANN). The NNU has an array of ALUs, a first memory that holds data words associated with ANN neuron outputs, and a second memory that holds weight words associated with connections between ANN neurons. Each ALU multiplies a portion of the data words by a portion of the weight words to generate products and accumulates the products in an accumulator as an accumulated value. Activation function units normalize the accumulated values to generate outputs associated with ANN neurons. The ISA includes at least one instruction that instructs the processor to write data words and the weight words to the respective first and second memories.
摘要:
A processor includes an architectural register file loadable with micro-operations by architectural instructions of an architectural instruction set of the processor and an execution unit that executes instructions. The instructions are either architectural instructions or microinstructions into which architectural instructions are translated. The execution unit includes a decoder that decodes the instructions into micro-operations, a mode indicator that indicates one of first and second modes, a pipeline of stages to which are provided micro-operations that control circuits of the stages of the pipeline, and a multiplexer. The multiplexer selects for provision to the pipeline a micro-operation received from the decoder when the mode indicator indicates the first mode and selects for provision to the pipeline a micro-operation received from the architectural register file when the mode indicator indicates the second mode.
摘要:
An apparatus including first and second reservation stations. The first reservation station dispatches a load micro instruction, and indicates on a hold bus if the load micro instruction is a specified load micro instruction directed to retrieve an operand from a prescribed resource other than on-core cache memory. The second reservation station is coupled to the hold bus, and dispatches one or more younger micro instructions therein that depend on the load micro instruction for execution after a number of clock cycles following dispatch of the first load micro instruction, and if it is indicated on the hold bus that the load micro instruction is the specified load micro instruction, the second reservation station is configured to stall dispatch of the one or more younger micro instructions until the load micro instruction has retrieved the operand. The resources include an input/output (I/O) unit, configured to perform I/O operations via an I/O bus coupling an out-of-order processor to I/O resources.
摘要:
A hardware data compressor. A hardware engine maintains first and second hash tables while it scans an input block of characters to be compressed. The first hash table is indexed by a hash of N characters of the input block. The second hash table is indexed by a hash of M characters of the input block. M is greater than two. N is greater than M. The engine uses the first hash table to search the input block behind a current search target location for a match of at least N characters at the current search target location, and uses the second hash table to search the input block behind the current search target location for a match of at least M characters at the current search target location when no match of at least N characters at the current search target location using the first hash table is found.