Abstract:
Processing circuitry performs processing operations specified by program instructions. An instruction decoder decodes an atomic-add-with-carry instruction AAD-DC to control the processing circuitry to perform an atomic operation of an add of an addend operand value and a data value stored in a memory to generate a result value stored in the memory and a carry value indicative of whether or not the add generated a carry out.
Abstract:
A graphics processing apparatus and method of performing graphics processing are provided. The graphics processing apparatus comprises a sequence of processing stages capable of performing graphics processing to generate a frame of display data. The graphics processing is performed on a tile-by-tile basis. The graphics processing apparatus is capable of determining if a current tile subject to the graphics processing is empty. At least one processing stage of the sequence of processing stages is omitted for graphics processing of the current tile in dependence on whether the current tile is empty.
Abstract:
A data processing apparatus comprises processing circuitry arranged to process processing threads using resources accessible to the processing circuitry. A pipeline is provided for handling at least two pending threads awaiting processing by the processing circuitry. The pipeline includes at least one resource-requesting pipeline stage for requesting access to resources for the pending threads. A priority controller controls priority levels of the pending threads. The priority levels define a priority with which pending threads are granted access to resources. When a pending thread reaches a final pipeline stage, if the request resources are not yet available then the priority level of that thread is raised selectively and the thread is returned to a first pipeline stage of the pipeline. If the requested resources are available then the thread is forwarded from the pipeline.
Abstract:
A data processor, such as a graphics processor, is disclosed. The data processor includes a set of one or more counters, and a control circuit that maintains a cache-like pool of corresponding entries. In response to a request for a counter, the control circuit may allocate an entry of the cache-like pool to thereby allocate a counter of the set.
Abstract:
Techniques for performing clipping of graphics primitives 60 with respect to a clipping boundary 65 are described. The clipping step 10 may be performed separately for each tile of a graphics frame to be rendered, after a primitive list for the tile has been read from a primitive memory 38. Clipping may be performed only for larger primitives whose size exceeds a given threshold. Clipping of a primitive 60 to the clipping boundary 65 may be performed inexactly so that only a single clipped primitive is generated which may extend beyond the clipping boundary. A clipped primitive generated by clipping may be used for a depth function calculation of a primitive setup operation and not for an edge determination.
Abstract:
A data processing apparatus and method are provided for processing a received workload in order to generate result data. A thread group generator generates from the received workload a plurality of thread groups to be executed to process the received workload. Each thread group consists of a plurality of threads, and at least one thread group has an inter-thread dependency existing between the plurality of threads. Each thread may be either an active thread whose output is required to form the result data, or a dummy thread required to resolve the inter-thread dependency for one of the active threads but whose output is not required to form the result data. The thread group generator identifies for each thread group any dummy thread within that thread group. A thread execution unit then executes each thread within a thread group received from the thread group generator by executing a predetermined program comprising a plurality of program instructions. Execution flow modification circuitry is responsive to the received thread group having at least one dummy thread, to cause the thread execution unit to selectively omit at least part of the execution of at least one of the plurality of instructions when executing each dummy thread, in dependence on control information associated with the predetermined program. In one particular embodiment the received workload is a graphics rendering workload and the thread execution unit performs graphics rendering operations in order to generate as the result data pixel values and associated control values. Such an approach can yield significant improvements in performance, as well as reducing power consumption.
Abstract:
A data processing system includes an external memory system, a processor and an internal memory system. The internal memory system includes an internal memory that stores data for use by the processor when performing data processing operations. The internal memory system also includes a data encoder associated with the internal memory. The data encoder reads data from the external memory system to the data encoder and returns the data to the external memory system from the data encoder, without storing the data in the internal memory.
Abstract:
An apparatus, method and program are provided for calculating a result value to a required precision of a repeating iterative sum, wherein the repeating iterative sum comprises multiple iterations of an addition using an input value. Addition is performed in a single iteration of addition as a sum operation using overlapping portions of the input value and a shifted version of the input value, wherein the shifted version of the input value has a partial overlap with the input value. At least one result portion is produced by incrementing an input derived from the input value using the output from the sum operation and the result value is constructed using the at least one result portion to give the result value to the required precision. The repeating iterative sum is thereby flattened into a flattened calculation which requires only a single iteration of addition using the input value, thus facilitating the calculation of the result value of the repeating iterative sum.
Abstract:
A data processing system includes a processing pipeline for the parallel execution of a plurality of threads. An issue controller issues threads to the processing pipeline. A stall manager controls the stalling and unstalling of threads when a cache miss occurs within a cache memory. The issue controller issues the threads to the processing pipeline in accordance with both a main sequence and a pilot sequence. The pilot sequence is followed such that threads within the pilot sequence are issued at least a given time ahead of their neighbours within a main sequence. The given time corresponds approximately to the latency associated with a cache miss. The threads may be arranged in groups corresponding to blocks of pixels for processing within a graphics processing unit.
Abstract:
A method of processing data in a graphics processor when performing tile-based rendering in which a render output is sub-divided into a plurality of tiles for rendering. The rendering is performed as two separate processing passes: a first processing pass that sorts primitives into respective regions of the render output and a second processing pass that renders the tiles into which the render output is sub-divided for rendering. During the first processing pass, “tile elimination” data is generated indicative of which of the rendering tiles should be rendered during the second processing pass. The tile elimination data generated in the first processing pass can then be used to control the rendering of tiles during the second processing pass.