摘要:
Deadlocks are avoided by marking read requests issued by a parallel processor to system memory as “special.” Read completions associated with read requests marked as special are routed on virtual channel 1 of the PCIe bus. Data returning on virtual channel 1 cannot become stalled by write requests in virtual channel 0, thus avoiding a potential deadlock.
摘要:
Deadlocks are avoided by marking read requests issued by a parallel processor to system memory as “special.” Read completions associated with read requests marked as special are routed on virtual channel 1 of the PCIe bus. Data returning on virtual channel 1 cannot become stalled by write requests in virtual channel 0, thus avoiding a potential deadlock.
摘要:
The invention sets forth a crossbar unit that includes multiple virtual channels, each virtual channel being a logical flow of data within the crossbar unit. Arbitration logic coupled to source client subsystems is configured to select a virtual channel for transmitting a data request or a data packet to a destination client subsystem based on the type of the source client subsystem and/or the type of data request. Higher priority traffic is transmitted over virtual channels that are configured to transmit data without causing deadlocks and/or stalls. Lower priority traffic is transmitted over virtual channels that can be stalled.
摘要:
The invention sets forth a crossbar unit that includes multiple virtual channels, each virtual channel being a logical flow of data within the crossbar unit. Arbitration logic coupled to source client subsystems is configured to select a virtual channel for transmitting a data request or a data packet to a destination client subsystem based on the type of the source client subsystem and/or the type of data request. Higher priority traffic is transmitted over virtual channels that are configured to transmit data without causing deadlocks and/or stalls. Lower priority traffic is transmitted over virtual channels that can be stalled.
摘要:
One embodiment of the present invention sets forth a compression status bit cache with deterministic latency for isochronous memory clients of compressed memory. The compression status bit cache improves overall memory system performance by providing on-chip availability of compression status bits that are used to size and interpret a memory access request to compressed memory. To avoid non-deterministic latency when an isochronous memory client accesses the compression status bit cache, two design features are employed. The first design feature involves bypassing any intermediate cache when the compression status bit cache reads a new cache line in response to a cache read miss, thereby eliminating additional, potentially non-deterministic latencies outside the scope of the compression status bit cache. The second design feature involves maintaining a minimum pool of clean cache lines by opportunistically writing back dirty cache lines and, optionally, temporarily blocking non-critical requests that would dirty already clean cache lines. With clean cache lines available to be overwritten quickly, the compression status bit cache avoids incurring additional miss write back latencies.
摘要:
One embodiment of the invention sets forth a mechanism for efficiently processing atomic operations transmitted from multiple general processing clusters to an L2 cache. A tag look-up unit tracks the availability of each cache line in the L2 cache, reserves the necessary cache lines for the atomic operations and transmits the atomic operations to an ALU for processing. The tag look-up unit also increments a reference counter associated with a reserved cache line each time an atomic operation associated with that cache line is received. This feature allows multiple atomic operations associated with the same cache line to be pipelined to the ALU. A ROP unit that includes the ALU may request additional data necessary to process an atomic operation from the L2 cache. Result data is stored in the L2 cache and may also be returned to the general processing clusters.
摘要:
One embodiment of the present invention sets forth a method for implementing ECC protection in an on-chip L2 cache. When data is written to or read from an external memory, logic within the L2 cache is configured to generate ECC check bits and store the ECC check bits in the L2 cache in space typically allocated for storing byte enables. As a result, data stored in the L2 cache may be protected against bit errors without incurring the costs of providing additional storage or complex hardware for the ECC check bits.
摘要:
A method for cleaning dirty data in an intermediate cache is disclosed. A dirty data notification, including a memory address and a data class, is transmitted by a level 2 (L2) cache to frame buffer logic when dirty data is stored in the L2 cache. The data classes may include evict first, evict normal and evict last. In one embodiment, data belonging to the evict first data class is raster operations data with little reuse potential. The frame buffer logic uses a notification sorter to organize dirty data notifications, where an entry in the notification sorter stores the DRAM bank page number, a first count of cache lines that have resident dirty data and a second count of cache lines that have resident evict_first dirty data associated with that DRAM bank. The frame buffer logic transmits dirty data associated with an entry when the first count reaches a threshold.
摘要:
One embodiment of the present invention sets forth a compression status cache configured to store compression information for blocks of memory stored within an external memory. A data cache unit is configured to request, in response to a cache miss, compressed data from the external memory based on compression information stored in the compression status bit cache. The compression status for active buffers is dynamically swapped into the compression status cache as needed. Different compression formats may be specified for one or more tiles within an active buffer. One advantage of the disclosed compression status cache is that a lame amount of attached memory may be allocated as compressible memory blocks, without incurring a corresponding die area cost because a portion of the compression status stored off chip in attached memory is cached in the compression status cache.
摘要:
A system and method for buffering intermediate data in a processing pipeline architecture stores the intermediate data in a shared cache that is coupled between one or more pipeline processing units and an external memory. The shared cache provides storage that is used by multiple pipeline processing units. The storage capacity of the shared cache is dynamically allocated to the different pipeline processing units as needed, to avoid stalling the upstream units, thereby improving overall system throughput.