Abstract:
A method and system for compressing and decompressing data is disclosed. A compression command may initiate the prefetching of first data, which may be stored in a first buffer. Multiple words of the first data may be read from the first buffer and used to generate a plurality of compressed packets, each of which includes a command specifying a type of packet. The compressed packets may be combined into a group and multiple groups may be combined and stored in a second buffer. A decompression command may initiate the prefetching of second data, which is stored in the first buffer. A portion of the second data may be read from the first buffer and used to generate a group of compressed packets. Multiple output words may be generated dependent upon the group of compressed packets.
Abstract:
Techniques are disclosed relating to completion of load and store instructions in a weakly-ordered memory model. In one embodiment, a processor includes a load queue and a store queue and is configured to associate queue information with a load instruction in an instruction stream. In this embodiment, the queue information indicates a location of the load instruction in the load queue and one or more locations in the store queue that are associated with one or more store instructions that are older than the load instruction. The processor may determine, using the queue information, that the load instruction does not conflict with a store instruction in the store queue that is older than the load instruction. The processor may remove the load instruction from the load queue while the store instruction remains in the store queue. The queue information may include a wrap value for the load queue.
Abstract:
In some embodiments, a system may include a sub-hierarchy clock control. In some embodiments, the system may include a master unit. The master unit may include an interface unit electrically coupled to a slave unit. The interface unit may monitor, during use, usage requests of the slave unit by the master unit. In some embodiments, the interface unit may turn off clocks to the slave unit during periods of nonuse. In some embodiments, the interface unit may determine if a predetermined period of time elapses before turning on clocks to the slave unit such that turning off the slave unit resulted in the system achieving greater efficiency. In some embodiments, the interface unit may maintain, during use, power to the slave unit during periods of nonuse. The interface unit may maintain power to the slave unit during periods of nonuse such that data stored in the slave unit is preserved.
Abstract:
In some embodiments, a system may include a sub-hierarchy clock control. In some embodiments, the system may include a master unit. The master unit may include an interface unit electrically coupled to a slave unit. The interface unit may monitor, during use, usage requests of the slave unit by the master unit. In some embodiments, the interface unit may turn off clocks to the slave unit during periods of nonuse. In some embodiments, the interface unit may determine if a predetermined period of time elapses before turning on clocks to the slave unit such that turning off the slave unit resulted in the system achieving greater efficiency. In some embodiments, the interface unit may maintain, during use, power to the slave unit during periods of nonuse. The interface unit may maintain power to the slave unit during periods of nonuse such that data stored in the slave unit is preserved.
Abstract:
Processors and methods for preventing lower level prefetch units from stalling at page boundaries. An upper level prefetch unit closest to the processor core issues a preemptive request for a translation of the next page in a given prefetch stream. The upper level prefetch unit sends the translation to the lower level prefetch units prior to the lower level prefetch units reaching the end of the current page for the given prefetch stream. When the lower level prefetch units reach the boundary of the current page, instead of stopping, these prefetch units can continue to prefetch by jumping to the next physical page number provided in the translation.
Abstract:
Systems, processors, and methods for keeping uncacheable data coherent. A processor includes a multi-level cache hierarchy, and uncacheable load memory operations can be cached at any level of the cache hierarchy. If an uncacheable load misses in the L2 cache, then allocation of the uncacheable load will be restricted to a subset of the ways of the L2 cache. If an uncacheable store memory operation hits in the L1 cache, then the hit cache line can be updated with the data from the memory operation. If the uncacheable store misses in the L1 cache, then the uncacheable store is sent to a core interface unit. Multiple contiguous store misses are merged into larger blocks of data in the core interface unit before being sent to the L2 cache.
Abstract:
In an embodiment, a coprocessor may include a plurality of processing element circuits arranged in a first grid, where a given coprocessor instruction of an instruction set for the coprocessor is defined to cause evaluation of a second plurality of processing element circuits arranged in a second grid, where the second grid includes more processing element circuits than the first grid. The coprocessor may further include a scheduler circuit configured to issue instruction operations to the plurality of processing element circuits, where the scheduler circuit is configured to issue a given instruction operation corresponding to the given coprocessor instruction a plurality of times to complete the given coprocessor instruction, wherein different issuances of the given instruction operation are configured to cause respective different portions of the evaluation defined by the given coprocessor instruction to be performed.
Abstract:
Techniques are disclosed relating to data synchronization barrier operations. A system includes a first processor that may receive a data barrier operation request from a second processor include in the system. Based on receiving that data barrier operation request from the second processor, the first processor may ensure that outstanding load/store operations executed by the first processor that are directed to addresses outside of an exclusion region have been completed. The first processor may respond to the second processor that the data barrier operation request is complete at the first processor, even in the case that one or more load/store operations that are directed to addresses within the exclusion region are outstanding and not complete when the first processor responds that the data barrier operation request is complete.
Abstract:
A system and method for efficiently transferring address mappings and data access permissions corresponding to the address mappings. A computing system includes at least one processor and memory for storing a page table. In response to receiving a memory access operation comprising a first address, the address translation unit is configured to identify a data access permission based on a permission index corresponding to the first address, and access data stored in a memory location of the memory identified by a second address in a manner defined by the retrieved data access permission. The address translation unit is configured to access a table to identify the data access permission, and is configured to determine the permission index and the second address based on the first address. A single permission index may correspond to different permissions for different entities within the system.
Abstract:
In an embodiment, a coprocessor may include a bypass indication which identifies execution circuitry that is not used by a given processor instruction, and thus may be bypassed. The corresponding circuitry may be disabled during execution, preventing evaluation when the output of the circuitry will not be used for the instruction. In another embodiment, the coprocessor may implement a grid of processing elements in rows and columns, where a given coprocessor instruction may specify an operation that causes up to all of the processing elements to operate on vectors of input operands to produce results. Implementations of the coprocessor may implement a portion of the processing elements. The coprocessor control circuitry may be designed to operate with the full grid or partial grid, reissuing instructions in the partial grid case to perform the requested operation. In still another embodiment, the coprocessor may be able to fuse vector mode operations.