摘要:
Systems, apparatuses, and methods for implementing a graphics processing unit (GPU) coprocessor are disclosed. The GPU coprocessor includes a SIMD unit with the ability to self-schedule sub-wave procedures based on input data flow events. A host processor sends messages targeting the GPU coprocessor to a queue. In response to detecting a first message in the queue, the GPU coprocessor schedules a first sub-task for execution. The GPU coprocessor includes an inter-lane crossbar and intra-lane biased indexing mechanism for a vector general purpose register (VGPR) file. The VGPR file is split into two files. The first VGPR file is a larger register file with one read port and one write port. The second VGPR file is a smaller register file with multiple read ports and one write port. The second VGPR introduces the ability to co-issue more than one instruction per clock cycle.
摘要:
A clock-less asynchronous processor comprising a plurality of parallel asynchronous processing logic circuits, each processing logic circuit configured to generate an instruction execution result. The processor comprises an asynchronous instruction dispatch unit coupled to each processing logic circuit, the instruction dispatch unit configured to receive multiple instructions from memory and dispatch individual instructions to each of the processing logic circuits. The processor comprises a crossbar coupled to an output of each processing logic circuit and to the dispatch unit, the crossbar configured to store the instruction execution results.
摘要:
A clock-less asynchronous processing circuit or system having a plurality of pipelined processing stages utilizes self-clocked generators to tune the delay needed in each of the processing stages to complete the processing cycle. Because different processing stages may require different amounts of time to complete processing or may require different delays depending on the processing required in a particular stage, the self-clocked generators may be tuned to each stage's necessary delay(s) or may be programmably configured.
摘要:
Disclosed herein is an arithmetic decoding device including: an arithmetic decoding unit configured to decode coded data resulting from arithmetic coding on a basis of a context variable indicating a probability state and a most probable symbol; a plurality of arithmetic registers configured to supply the context variable to the arithmetic decoding unit and retain a result of operation by the arithmetic decoding unit; and a plurality of save registers configured to save contents retained in the arithmetic registers.
摘要:
A graphics processing system for operation with a data store includes processing units for processing tasks. A check unit forms a signature which is characteristic of an output from processing a task on a processing unit, and a fault detection unit compares signatures formed at the check unit. The graphics processing system processes each task first and second times at the processing units so as to generate first and second processed outputs. The graphics processing system write outs the first processed output to the data store, reads back the first processed output from the data store and forms at the check unit a first signature characteristic of the first processed output as read back from the data store; forms at the check unit a second signature characteristic of the second processed output, compares the first and second signatures at the fault detection unit, and raises a fault signal if the signatures do not match.
摘要:
A graphics processing system for operation with a data store, comprising: one or more processing units for processing tasks; a check unit operable to form a signature which is characteristic of an output from processing a task on a processing unit; and a fault detection unit operable to compare signatures formed at the check unit; wherein the graphics processing system is operable to process each task first and second times at the one or more processing units so as to, respectively, generate first and second processed outputs, the graphics processing system being configured to: write out the first processed output to the data store; read back the first processed output from the data store and form at the check unit a first signature which is characteristic of the first processed output as read back from the data store; form at the check unit a second signature which is characteristic of the second processed output; compare the first and second signatures at the fault detection unit; and raise a fault signal if the first and second signatures do not match.
摘要:
An integrated coprocessor such as an accelerated processing unit (APU) generates commands for execution on a discrete coprocessor such as a discrete graphics processing unit (dGPU). Power distribution circuitry selectively provides power to the APU and the dGPU based on characteristics of workloads executing on the APU and the dGPU and based on a platform power limit that is shared by the APU and the dGPU. In some cases, the power distribution circuitry determines a first power provided to the APU and a second power provided to the dGPU. The power distribution circuitry increases the second power provided to the dGPU in response to a sum of the first and second powers being less than the platform power limit. In some cases, the power distribution circuitry modifies the power provided to the APU, the dGPU, or both in response to changes in temperatures measured by a set of sensors.
摘要:
A clock-less asynchronous processing circuit or system utilizes a self-clocked generator to adjust the processing delay (latency) needed/allowed to the processing cycle in the circuit/system. The timing of the self-clocked generator is dynamically adjustable depending on various parameters. These parameters may include processing instruction, opcode information, type of processing to be performed by the circuit/system, or overall desired processing performance. The latency may also be adjusted to change processing performance, including power consumption, speed etc.
摘要:
A clock-less asynchronous processing circuit or system is configured to operation in a plurality of modes. In an initialization mode (e.g., reset, initialization, boot up), a self-clocked generator associated with the asynchronous circuit is configured to generate an active complete signal (to latch output processed data) within a first period of time after receiving a trigger signal. In a normal mode, the self-clocked generator is configured to generate the active complete signal within a second period of time after receiving the trigger signal. In one embodiment, during the initialization mode, the asynchronous circuit latches the output slower than when in the normal mode.
摘要:
When a main processor issues a command to co-processor, a timeout value is included in the command. As the co-processor attempts to execute the command, it is determined whether the attempt is taking time beyond what is permitted by the timeout value. If the timeout is exceeded then responsive action is taken, such as the generation of a command timeout type failure message. The receipt of the command with the timeout value, and the consequent determination of a timeout condition for the command, may be determined by: the co-processor that receives the command, or a watchdog timer that is separate from the co-processor. Also, detection of co-processor hang and/or hung co-processor conditions during the time that a co-processor is executing a command for the main processor.