Abstract:
In an embodiment, a system may support programmable hashing of address bits at a plurality of levels of granularity to map memory addresses to memory controllers and ultimately at least to memory devices. The hashing may be programmed to distribute pages of memory across the memory controllers, and consecutive blocks of the page may be mapped to physically distant memory controllers. In an embodiment, address bits may be dropped from each level of granularity, forming a compacted pipe address to save power within the memory controller. In an embodiment, a memory folding scheme may be employed to reduce the number of active memory devices and/or memory controllers in the system when the full complement of memory is not needed.
Abstract:
Techniques are disclosed relating to an I/O agent circuit. The I/O agent circuit may include a transaction pipeline and a pool of counters. The I/O agent circuit may initialize a first counter included in the pool of counters with an initial counter value. The I/O agent circuit may assign the first counter to a specific transaction type. The I/O agent circuit may increment the first counter as a part of allocating a transaction of a transaction type included in a set of transaction types different than the specific transaction type. Based on receiving a transaction request to process a first transaction of the specific transaction type, the I/O agent circuit may bind the first transaction to the first counter. The I/O agent circuit may issue the first transaction to the transaction pipeline based on a counter value stored by the first counter matching the initial counter value.
Abstract:
An integrated circuit (IC) including a plurality of processor cores, a plurality of graphics processing units, a plurality of peripheral circuits, and a plurality of memory controllers is configured to support scaling of the system using a unified memory architecture. For example, the IC may include an interconnect fabric configured to provide communication between the one or more memory controller circuits and the processor cores, graphics processing units, and peripheral devices; and an off-chip interconnect coupled to the interconnect fabric and configured to couple the interconnect fabric to a corresponding interconnect fabric on another instance of the integrated circuit, wherein the interconnect fabric and the off-chip interconnect provide an interface that transparently connects the one or more memory controller circuits, the processor cores, graphics processing units, and peripheral devices in either a single instance of the integrated circuit or two or more instances of the integrated circuit.
Abstract:
Systems, apparatuses, and methods for efficiently moving data for storage and processing a compression unit within a processor includes multiple hardware lanes, selects two or more input words to compress, and for assigns them to two or more of the multiple hardware lanes. As each assigned input word is processed, each word is compared to an entry of a plurality of entries of a table. If it is determined that each of the assigned input words indexes the same entry of the table, the hardware lane with the oldest input word generates a single read request for the table entry and the hardware lane with the youngest input word generates a single write request for updating the table entry upon completing compression. Each hardware lane generates a compressed packet based on its assigned input word.
Abstract:
In an embodiment, an integrated circuit may include one or more processors. Each processor may include multiple processor cores, and each core has a different design/implementation and performance level. For example, a core may be implemented for high performance, and another core may be implemented at a lower maximum performance, but may be optimized for efficiency. Additionally, in some embodiments, some features of the instruction set architecture implemented by the processor may be implemented in only one of the cores that make up the processor. If such a feature is invoked by a code sequence while a different core is active, the processor may swap cores to the core the implements the feature. Alternatively, an exception may be taken and an exception handler may be executed to identify the feature and activate the corresponding core.
Abstract:
In an embodiment, a system includes multiple power management mechanism operating in different time domains (e.g. with different bandwidths) and control circuitry that is configured to coordinate operation of the mechanisms. If one mechanism is adding energy to the system, for example, the control circuitry may inform another mechanism that the energy is coming so that the other mechanism may not take as drastic an action as it would if no energy were coming. If a light workload is detected by circuitry near the load, and there is plenty of energy in the system, the control circuitry may cause the power management unit (PMU) to generate less energy or even temporarily turn off. A variety of mechanisms for the coordinated, coherent use of power are described.
Abstract:
In an embodiment, an integrated circuit may include one or more processors. Each processor may include multiple processor cores, and each core has a different design/implementation and performance level. For example, a core may be implemented for high performance, and another core may be implemented at a lower maximum performance, but may be optimized for efficiency. Additionally, in some embodiments, some features of the instruction set architecture implemented by the processor may be implemented in only one of the cores that make up the processor. If such a feature is invoked by a code sequence while a different core is active, the processor may swap cores to the core the implements the feature. Alternatively, an exception may be taken and an exception handler may be executed to identify the feature and activate the corresponding core.
Abstract:
Various techniques for processing instructions that specify multiple destinations. A first portion of a processor pipeline is configured to split a multi-destination instruction into a plurality of single-destination operations. A second portion of the pipeline is configured to process the plurality of single-destination operations. A third portion of the pipeline is configured to merge the plurality of single-destination operations into one or more multi-destination operations. The one or more multi-destination operations may be performed. The first portion of the pipeline may include a decode unit. The second portion of the pipeline may include a map unit, which may in turn include circuitry configured to maintain a list of free architectural registers and a mapping table that maps physical registers to architectural registers. The third portion of the pipeline may comprise a dispatch unit. In some embodiments, this may provide certain advantages such as reduced area and/or power consumption.
Abstract:
Systems, processors, and methods for keeping uncacheable data coherent. A processor includes a multi-level cache hierarchy, and uncacheable load memory operations can be cached at any level of the cache hierarchy. If an uncacheable load misses in the L2 cache, then allocation of the uncacheable load will be restricted to a subset of the ways of the L2 cache. If an uncacheable store memory operation hits in the L1 cache, then the hit cache line can be updated with the data from the memory operation. If the uncacheable store misses in the L1 cache, then the uncacheable store is sent to a core interface unit. Multiple contiguous store misses are merged into larger blocks of data in the core interface unit before being sent to the L2 cache.
Abstract:
In an embodiment, a processor may implement an access map-pattern match (AMPM)-based prefetcher in which patterns may include wild cards for some cache blocks. The wild card may match any access for the corresponding cache block (e.g. no access, demand access, prefetch, successful prefetch, etc.). Furthermore, patterns with irregular strides and/or irregular access patterns may be included in the matching patterns and may be detected for prefetch generation. In an embodiment, the AMPM prefetcher may implement a chained access map for large streaming prefetches. If a stream is detected, the AMPM prefetcher may allocate a pair of map entries for the stream and may reuse the pair for subsequent access map regions within the stream. In some embodiments, a quality factor may be associated with each access map and may control the rate of prefetch generation.