Abstract:
An apparatus comprises matrix processing circuitry to perform a matrix processing operation on first and second input operands to generate a result matrix, where the result matrix is a two-dimensional matrix; operand storage circuitry to store information for forming the first and second input operands for the matrix processing circuitry; and masking circuitry to perform a masking operation to mask at least part of the matrix processing operation or the information stored to the operand storage circuitry based on masking state data indicative of one or more masked row or column positions to be treated as representing a masking value. This is useful for improving performance of two-dimensional convolution operations, as the masking can be used to mask out selected rows or columns when performing the 2D convolution as a series of 1×1 convolution operations applied to different kernel positions.
Abstract:
An apparatus comprises: an instruction decoder and processing circuitry. In response to a data structure processing instruction specifying at least one input data structure identifier and an output data structure identifier, the instruction decoder controls the processing circuitry to perform a processing operation on at least one input data structure to generate an output data structure. Each input/output data structure comprises an arrangement of data corresponding to a plurality of memory addresses. The apparatus comprises two or more sets of one or more data structure metadata registers, each set associated with a corresponding data structure identifier and designated to store address-indicating metadata for identifying the memory addresses for the data structure identified by the corresponding data structure identifier.
Abstract:
A data processing apparatus and method are provided for executing a plurality of threads. Processing circuitry performs processing operations required by the plurality of threads, the processing operations including a lock-protected processing operation with which a lock is associated, where the lock needs to be acquired before the processing circuitry performs the lock-protected processing operation. Baton maintenance circuitry is used to maintain a baton in association with the plurality of threads, the baton forming a proxy for the lock, and the baton maintenance circuitry being configured to allocate the baton between the threads. Via communication between the processing circuitry and the baton maintenance circuitry, once the lock has been acquired for one of the threads, the processing circuitry performs the lock-protected processing operation for multiple threads before the lock is released, with the baton maintenance circuitry identifying a current thread amongst the multiple threads for which the lock-protected processing operation is to be performed by allocating the baton to that current thread. The baton can hence be passed from one thread to the next, without needing to release and re-acquire the lock. This provides a significant performance improvement when performing lock-protected processing operations across multiple threads.
Abstract:
Instruction decoder to decode processing instructions; one or more first registers; first processing circuitry to execute the decoded processing instructions in a first processing mode and configured to execute the decoded processing instructions using the one or more first registers; and control circuitry to execute the decoded processing instructions in a second processing mode using one or more second registers; the instruction decoder being configured to decode processing instructions selected from a first instruction set and a second instruction set in the second processing mode, in which one or both of the first and second instruction sets comprises at least one unique instruction set; the instruction decoder configured to decode one or more mode change instructions to change between the first and second processing mode; and the first processing circuitry configured to change the current processing mode between the first and second processing mode responding to executing mode change instruction.
Abstract:
A data processing apparatus is provided comprising: a plurality of storage circuits to store data. Execution circuitry performs one or more operations using the storage circuits in response to instructions. The instructions include a relinquish instruction. The execution circuitry responds to the relinquish instruction by indicating that at least one of the plurality of storage circuits is an unused storage circuit and the execution circuitry affects execution of future instructions based on the unused storage circuit after executing the relinquish instruction.
Abstract:
A data processing apparatus (100) executes threads and includes a general program counter (PC) (120) identifying an instruction to be executed for at least a subset of the threads. Each thread has a thread PC (184). The subset of threads has at least one lock parameter (188, 500-504) for tracking exclusive access to shared resources. In response to a first instruction executed for a thread, the processor (160) modifies the at least one lock parameter (188), (500-504) to indicate that the thread has gained exclusive access to the shared resource. In response to a second instruction, the processor modifies the at least one lock parameter (188, 500-504) to indicate that the thread no longer has exclusive access. A selector (110) selects one of the subset of threads based on the at least one lock parameter (188, 500-504) and sets the general PC (120) to the thread PC (184) of the selected thread.
Abstract:
Techniques for performing matrix multiplication in a data processing apparatus are disclosed, comprising apparatuses, matrix multiply instructions, methods of operating the apparatuses, and virtual machine implementations. Registers, each register for storing at least four data elements, are referenced by a matrix multiply instruction and in response to the matrix multiply instruction a matrix multiply operation is carried out. First and second matrices of data elements are extracted from first and second source registers, and plural dot product operations, acting on respective rows of the first matrix and respective columns of the second matrix are performed to generate a square matrix of result data elements, which is applied to a destination register. A higher computation density for a given number of register operands is achieved with respect to vector-by-element techniques.
Abstract:
A single instruction multiple thread (SIMT) processor 2 includes execution circuitry 6, prefetch circuitry 12 and prefetch strategy selection circuitry 14. The prefetch strategy selection circuitry serves to detect one or more characteristics of a stream of program instructions that are being executed to identify whether or not a given data access instruction within a program will be executed a plurality of times. The prefetch strategy to use is selected from a plurality of selectable prefetch strategy in dependence upon the detection of such characteristics.
Abstract:
A single instruction multiple thread (SIMT) processor 2 includes scheduling circuitry 8 for calculating a next scheduled execution point for execution circuits 4 which execute respective threads corresponding to a common program. In addition to calculating the next scheduled execution point, the scheduling circuitry determines a runner up execution point which would have been determined as the next scheduled execution point if the threads which actually correspond to the next scheduled execution point had been removed from consideration. This runner up execution point is used to identify points of re-convergence within the program flow and as part of the operation of a static branch predictor 10.
Abstract:
An apparatus and method are provided for performing register renaming. Available register identifying circuitry is provided to identify which physical registers form a pool of physical registers available to be mapped by register renaming circuitry to an architectural register specified by an instruction to be executed. Configuration data whose value is modified during operation of the processing circuitry is stored such that, when the configuration data has a first value, the configuration data identifies at least one architectural register of the architectural register set which does not require mapping to a physical register by the register renaming circuitry. The register identifying circuitry is arranged to reference the modified data value, such that when the configuration data has the first value, the number of physical registers in the pool is increased due to the reduction in the number of architectural registers which require mapping to physical registers.