摘要:
A clock recovery circuit includes a sampler for sampling a data signal. Logic determines whether a data edge lags or precedes a clock edge which drives the sampler, and provides early and late indications. A filter filters the early and late indications, and a phase controller adjusts the phase of the clock based on the filtered indications. Based on the filtered indications, a frequency estimator estimates the frequency difference between the data and clock, providing an input to the phase controller to further adjust the phase so as to continually correct for the frequency difference.
摘要:
One embodiment of the present disclosure sets forth an effective way to maintain fairness and order in the scheduling of common resource access requests related to replay operations. Specifically, a streaming multiprocessor (SM) includes a total order queue (TOQ) configured to schedule the access requests over one or more execution cycles. Access requests are allowed to make forward progress when needed common resources have been allocated to the request. Where multiple access requests require the same common resource, priority is given to the older access request. Access requests may be placed in a sleep state pending availability of certain common resources. Deadlock may be avoided by allowing an older access request to steal resources from a younger resource request. One advantage of the disclosed technique is that older common resource access requests are not repeatedly blocked from making forward progress by newer access requests.
摘要:
One embodiment of the invention sets forth a mechanism for efficiently write dirty data from the L2 cache to a DRAM. A dirty data notification, including a memory address of the dirty data, is transmitted by the L2 cache to a frame buffer logic when dirty data is stored in the L2 cache. The frame buffer logic uses a page-stream sorter to organize dirty data notifications based on the bank page associated with the memory addresses included in the dirty data notifications. The page-stream sorter includes multiple sets with entries that may be associated with different bank pages in the DRAM. The frame buffer logic transmits dirty data associated with an entry that has a maximum threshold of dirty data notifications to the DRAM. The frame buffer logic also transmits dirty data associated with the oldest entry when the number of entries in a set reaches a maximum threshold.
摘要:
One embodiment of the invention is a method for evicting data from an intermediary cache that includes the steps of receiving a command from a client, determining that there is a cache miss relative to the intermediary cache, identifying one or more cache lines within the intermediary cache to store data associated with the command, determining whether any of data residing in the one or more cache lines includes raster operations data or normal data, and causing the data residing in the one or more cache lines to be evicted or stalling the command based, at least in part, on whether the data includes raster operations data or normal data. Advantageously, the method allows a series of cache eviction policies based on how cached data is categorized and the eviction classes of the data. Consequently, more optimized eviction decisions may be made, leading to fewer command stalls and improved performance.
摘要:
A system and method for cleaning dirty data in an intermediate cache are disclosed. A dirty data notification, including a memory address and a data class, is transmitted by a level 2 (L2) cache to frame buffer logic when dirty data is stored in the L2 cache. The data classes include evict first, evict normal and evict last. In one embodiment, data belonging to the evict first data class is raster operations data with little reuse potential. The frame buffer logic uses a notification sorter to organize dirty data notifications, where an entry in the notification sorter stores the DRAM bank page number, a first count of cache lines that have resident dirty data and a second count of cache lines that have resident evict_first dirty data associated with that DRAM bank. The frame buffer logic transmits dirty data associated with an entry when the first count reaches a threshold.
摘要:
Embodiments of the present invention enable virtual-to-physical memory address translation using optimized bank and partition interleave patterns to improve memory bandwidth by distributing data accesses over multiple banks and multiple partitions. Each virtual page has a corresponding page table entry that specifies the physical address of the virtual page in linear physical address space. The page table entry also includes a data kind field that is used to guide and optimize the mapping process from the linear physical address space to the DRAM physical address space, which is used to directly access one or more DRAM. The DRAM physical address space includes a row, bank and column address. The data kind field is also used to optimize the starting partition number and partition interleave pattern that defines the organization of the selected physical page of memory within the DRAM memory system.
摘要:
Systems and methods for addressing memory using non-power-of-two virtual memory page sizes improve graphics memory bandwidth by distributing graphics data for efficient access during rendering. Various partition strides may be selected for each virtual memory page to modify the number of sequential addresses mapped to each physical memory partition and change the interleaving granularity. The addressing scheme allows for modification of a bank interleave pattern for each virtual memory page to reduce bank conflicts and improve memory bandwidth utilization. The addressing scheme also allows for modification of a partition interleave pattern for each virtual memory page to distribute accesses amongst multiple partitions and improve memory bandwidth utilization.
摘要:
A method of scheduling program instructions for execution in a computer processor comprises fetching and holding instructions from an instruction memory and executing the fetched instructions out of program order. When load/store order violations are detected, the effects of the load operation and its dependent instructions are erased and they are re-executed. The load is associated with all stores on whose data the load depends. This collection of stores is called a store set. On a subsequent issuance of the load, its execution is delayed until any store in the load's store set has issued. Two loads may share a store set, and separate store sets are merged when a load from one store set is found to depend on a store from another store set. A preferred embodiment employs two tables. The first is a store set ID table (SSIT) which is indexed by part of, or a hash of, an instruction PC. Entries in the SSIT provide a store set ID which is used to index into the second table, which for each store set, contains a pointer to the last fetched, unexecuted store instruction.
摘要:
One embodiment of the present disclosure sets forth an effective way to maintain fairness and order in the scheduling of common resource access requests related to replay operations. Specifically, a streaming multiprocessor (SM) includes a total order queue (TOQ) configured to schedule the access requests over one or more execution cycles. Access requests are allowed to make forward progress when needed common resources have been allocated to the request. Where multiple access requests require the same common resource, priority is given to the older access request. Access requests may be placed in a sleep state pending availability of certain common resources. Deadlock may be avoided by allowing an older access request to steal resources from a younger resource request. One advantage of the disclosed technique is that older common resource access requests are not repeatedly blocked from making forward progress by newer access requests.
摘要:
One embodiment of the present disclosure sets forth an effective way to maintain fairness and order in the scheduling of common resource access requests related to replay operations. Specifically, a streaming multiprocessor (SM) includes a total order queue (TOQ) configured to schedule the access requests over one or more execution cycles. Access requests are allowed to make forward progress when needed common resources have been allocated to the request. Where multiple access requests require the same common resource, priority is given to the older access request. Access requests may be placed in a sleep state pending availability of certain common resources. Deadlock may be avoided by allowing an older access request to steal resources from a younger resource request. One advantage of the disclosed technique is that older common resource access requests are not repeatedly blocked from making forward progress by newer access requests.