Abstract:
Broadly speaking, embodiments of the present techniques provide a reconfigurable hardware-based artificial neural network, wherein weights for each neural network node of the artificial neural network are obtained via training performed external to the neural network.
Abstract:
The present disclosure relates to a data processor for processing data, comprising: a plurality of execution units to execute one or more operations; and a plurality of storage elements to store data for the one or more operations, the data processor being configured to process at least one task, each task to be executed in the form of a directed acyclic graph of operations, wherein each of the operations maps to a corresponding execution unit and each connection between operations in the acyclic graph maps to a corresponding storage element, the data processor further comprising: a plurality of counters; and a control module to control the plurality of counters to: in a first mode, count an operation cycle number associated with each operation of the at least one task, the operation cycle number of an operation being a number of cycles required to complete the operation; and in a second mode, count a unit cycle number associated with one or more execution units, the unit cycle number of an execution unit being an accumulative number of cycles when the execution unit is occupied in use during execution of the at least one task.
Abstract:
A processor to obtain mapping data indicative of at least one mapping parameter for a plurality of mapping blocks of a multi-dimensional tensor to be mapped. The at least one mapping parameter is for mapping corresponding elements of each mapping block to the same co-ordinate in at least one selected dimension of the multi-dimensional tensor, such that each mapping block corresponds to the same set of co-ordinates in the at least one selected dimension. A co-ordinate of an element of a block of the multi-dimensional tensor is determined. The element is comprised by a mapping block. A physical address in a storage corresponding to the co-ordinate is determined, based on the co-ordinate. The physical address is utilized in a process comprising an interaction between the block of the multi-dimensional tensor and the storage.
Abstract:
A processor to obtain mapping data indicative of at least one mapping parameter for a plurality of mapping blocks of a multi-dimensional tensor to be mapped. The at least one mapping parameter is for mapping corresponding elements of each mapping block to the same co-ordinate in at least one selected dimension of the multi-dimensional tensor, such that each mapping block corresponds to the same set of co-ordinates in the at least one selected dimension. A co-ordinate of an element of a block of the multi-dimensional tensor is determined. The element is comprised by a mapping block. A physical address in a storage corresponding to the co-ordinate is determined, based on the co-ordinate. The physical address is utilized in a process comprising an interaction between the block of the multi-dimensional tensor and the storage.
Abstract:
A single instruction multiple thread (SIMT) processor includes execution circuitry, prefetch circuitry and prefetch strategy selection circuitry. The prefetch strategy selection circuitry serves to detect one or more characteristics of a stream of program instructions that are being executed to identify whether or not a given data access instruction within a program will be executed a plurality of times. The prefetch strategy to use is selected from a plurality of selectable prefetch strategies in dependence upon the detection of such detected characteristics.
Abstract:
A data processing system includes: a processor; a data interface for communication with a control unit, the processor being on one side of the data interface; internal storage accessible by the processor, the internal storage being on the same side of the data interface as the processor; and a register array accessible by the processor and comprising a plurality of registers, each register having a plurality of vector lanes. The storage is arranged to store control data indicating an ordered selection of vector lanes of one or more of the registers. The processor is arranged to, in response to receiving instruction data from a control unit, perform a swizzle operation in which data is selected from one or more source registers in the register array, and transferred to a destination register. The data is selected from vector lanes in accordance with control data stored in the internal storage.
Abstract:
A neural network processor is disclosed that includes a combined convolution and pooling circuit that can perform both convolution and pooling operations. The circuit can perform a convolution operation by a multiply circuit determining products of corresponding input feature map and convolution kernel weight values, and an add circuit accumulating the products determined by the multiply circuit in storage. The circuit can perform an average pooling operation by the add circuit accumulating input feature map data values in the storage, a divisor circuit determining a divisor value, and a division circuit dividing the data value accumulated in the storage by the determined divisor value. The circuit can perform a maximum pooling operation by a maximum circuit determining a maximum value of input feature map data values, and storing the determined maximum value in the storage.
Abstract:
An apparatus and method are provided for executing a plurality of threads. The apparatus has processing circuitry arranged to execute the plurality of threads, with each thread executing a program to perform processing operations on thread data. Each thread has a thread identifier, and the thread data includes a value which is dependent on the thread identifier. Value generator circuitry is provided to perform a computation using the thread identifier of a chosen thread in order to generate the above mentioned value for that chosen thread, and to make that value available to the processing circuitry for use by the processing circuitry when executing the chosen thread. Such an arrangement can give rise to significant performance benefits when executing the plurality of threads on the apparatus.
Abstract:
A data processing apparatus 2 has processing circuitry 4 which can process multiple parallel threads of processing. A shared instruction decoder 30 decodes program instructions to generate micro-operations to be processed by the processing circuitry 4. The instructions include at least one complex instruction which has multiple micro-operations. Multiple fetch units 8 are provided for fetching the micro-operations generated by the decoder 30 for processing by the processing circuitry 4. Each fetch unit 8 is associated with at least one of the threads. The decoder 30 generates the micro-operations of a complex instruction individually in response to separate decode requests 24 triggered by a fetch unit 8, each decode request 24 identifying which micro-operation of the complex instruction is to be generated by the decoder 30 in response to the decode request 24.
Abstract:
A data processing device 100 comprises a plurality of storage circuits 130, 160, which store a plurality of data elements of the bits in an interleaved manner. Data processing device also comprises a consumer 110 with a number of lanes 120. The consumer is able to individually access each of the plurality of storage circuits 130, 160 in order to receive into the lanes 120 either a subset of the plurality of data elements or y bits of each of the plurality of data elements. The consumer 110 is also able to execute a common instruction of each of the plurality of lanes 120. The relationship of the bits is such that b is greater than y and is an integer multiple of y. Each of the plurality of storage circuits 130, 160 stores at most y bits of each of the data elements. Furthermore, each of the storage circuits 130, 160 stores at most y/b of the plurality of data elements. By carrying out the interleaving in this manner, the plurality of storage circuits 130, 160 comprise no more than b/y storage circuits.