Abstract:
Techniques for performing matrix multiplication in a data processing apparatus are disclosed, comprising apparatuses, matrix multiply instructions, methods of operating the apparatuses, and virtual machine implementations. Registers, each register for storing at least four data elements, are referenced by a matrix multiply instruction and in response to the matrix multiply instruction a matrix multiply operation is carried out. First and second matrices of data elements are extracted from first and second source registers, and plural dot product operations, acting on respective rows of the first matrix and respective columns of the second matrix are performed to generate a square matrix of result data elements, which is applied to a destination register. A higher computation density for a given number of register operands is achieved with respect to vector-by-element techniques.
Abstract:
A single instruction multiple thread (SIMT) processor 2 includes execution circuitry 6, prefetch circuitry 12 and prefetch strategy selection circuitry 14. The prefetch strategy selection circuitry serves to detect one or more characteristics of a stream of program instructions that are being executed to identify whether or not a given data access instruction within a program will be executed a plurality of times. The prefetch strategy to use is selected from a plurality of selectable prefetch strategy in dependence upon the detection of such characteristics.
Abstract:
A processor to: receive a task to be executed, the task comprising a task-based parameter associated with the task, for use in determining a position, within an array of data descriptors, of a particular data descriptor of a particular portion of data to be processed in executing the task. Each of the data descriptors in the array of data descriptors has a predetermined size and is indicative of a location in a storage system of a respective portion of data. The processor derives, based on the task, array location data indicative of a location in the storage system of a predetermined data descriptor, and obtains the particular data descriptor, based on the array location data and the task-based parameter. The processor obtains the particular portion of data based on the particular data descriptor and processes the particular portion of data in executing the task.
Abstract:
An apparatus and method are provided for executing a plurality of threads. The apparatus has processing circuitry arranged to execute the plurality of threads, with each thread executing a program to perform processing operations on thread data. Each thread has a thread identifier, and the thread data includes a value which is dependent on the thread identifier. Value generator circuitry is provided to perform a computation using the thread identifier of a chosen thread in order to generate the above mentioned value for that chosen thread, and to make that value available to the processing circuitry for use by the processing circuitry when executing the chosen thread. Such an arrangement can give rise to significant performance benefits when executing the plurality of threads on the apparatus.
Abstract:
A data processing apparatus and method of data processing are disclosed. An instruction execution unit executes a sequence of program instructions, wherein execution of at least some of the program instructions initiates memory access requests to retrieve data values from a memory. A prefetch unit prefetches data values from the memory for storage in a cache unit before they are requested by the instruction execution unit. The prefetch unit is configured to perform a miss response comprising increasing a number of the future data values which it prefetches, when a memory access request specifies a pending data value which is already subject to prefetching but is not yet stored in the cache unit. The prefetch unit is also configured, in response to an inhibition condition being met, to temporarily inhibit the miss response for an inhibition period.
Abstract:
A data processing device includes processing circuitry 20 for executing a first memory access instruction to a first address of a memory device 40 and a second memory access instruction to a second address of the memory device 40, the first address being different from the second address. The data processing device also includes prefetching circuitry 30 for prefetching data from the memory device 40 based on a stride length 70 and instruction analysis circuitry 50 for determining a difference between the first address and the second address. Stride refining circuitry 60 is also provided to refine the stride length based on factors of the stride length and factors of the difference calculated by the instruction analysis circuitry 50.
Abstract:
A data processing apparatus 2 has processing circuitry 4 which can process multiple parallel threads of processing. A shared instruction decoder 30 decodes program instructions to generate micro-operations to be processed by the processing circuitry 4. The instructions include at least one complex instruction which has multiple micro-operations. Multiple fetch units 8 are provided for fetching the micro-operations generated by the decoder 30 for processing by the processing circuitry 4. Each fetch unit 8 is associated with at least one of the threads. The decoder 30 generates the micro-operations of a complex instruction individually in response to separate decode requests 24 triggered by a fetch unit 8, each decode request 24 identifying which micro-operation of the complex instruction is to be generated by the decoder 30 in response to the decode request 24.