Abstract:
An instruction scanning unit for a superscalar microprocessor is disclosed. The instruction scanning unit processes start, end, and functional byte information (or predecode data) associated with a plurality of contiguous instruction bytes. The processing of start byte information and end byte information is performed independently and in parallel, and the instruction scanning unit produces a plurality of scan values which identify valid instructions within the plurality of contiguous instruction bytes. Additionally, the instruction scanning unit is scaleable. Multiple instruction scanning units may be operated in parallel to process a larger plurality of contiguous instruction bytes. Furthermore, the instruction scanning unit detects error conditions in the predecode data in parallel with scanning to locate instructions. Moreover, in parallel with the error checking and scanning to locate instructions, MROM instructions are located for dispatch to an MROM unit.
Abstract:
A shared branch prediction mechanism is provided in which a pool of branch prediction storage locations are shared among the multiple cache lines comprising a row of the instruction cache. The branch prediction storage locations within the pool are dynamically redistributed among the cache lines according to the number of branch instructions within each cache line. A cache line having a large number of branch instructions may be allocated more branch prediction storage locations than a cache line having fewer branch instructions. A prediction selector is included for each cache line in the instruction cache. The prediction selector indicates the selection of one or more branch prediction storage locations which store branch predictions corresponding to the cache line. In one embodiment, the prediction selector comprises multiple branch selectors. One branch selector is associated with each byte in the cache line, and identifies the branch prediction storage location storing the relevant branch prediction for that byte. In another embodiment, each set of two bytes within a cache line shares a portion of the pool with the corresponding set of two bytes from the other cache lines within the pool. The prediction selector for the cache line indicates which sections of the cache line have associated branch prediction storage locations allocated to them, as well as a taken/not-taken prediction associated therewith. The first taken prediction within the line subsequent to the offset indicated by the fetch address is the branch prediction selected.
Abstract:
A superscalar microprocessor is provided employing a way prediction unit which predicts the next fetch address as well as the way of the instruction cache that the current fetch address hits in while the instructions associated with the current fetch are being read from the instruction cache. The microprocessor may achieve high frequency operation while using an associative instruction cache. An instruction fetch can be made every clock cycle using the predicted fetch address from the way prediction unit until an incorrect next fetch address or an incorrect way is predicted. The instructions from the predicted way are provided to the instruction processing pipelines of the superscalar microprocessor each clock cycle.
Abstract:
Contention handling apparatus which receives access request signals from a number of users and processes these requests to allow controlled access to a shared resource. The contention handling apparatus includes a number of access blocks, with one of the access blocks being associated with each user. A busy line of each of the access blocks is connected to receive a busy signal; the busy signal being an access request signal from a higher priority user, thereby indicating that the shared resource is unavailable. Each access block receiving a busy signal, latches the corresponding access request signal until the busy signal is deasserted. If the busy signal and the access request signal occur at the same time, the corresponding access block generates a wait output signal. The logical sum of the wait output of an access block associated with a next higher priority user and the access request signals of all the higher priority users serves as the busy signal for one of the access blocks.
Abstract:
A method and apparatus is disclosed for reducing the propagation delay associated with the critical speed path of a binary logic circuit by using "multiplexing logic". More specifically, the inputs to the logic circuit are defined as either critical or non-critical inputs and the product terms are manipulated so that the non-critical inputs are mutually exclusive. The non-critical inputs are supplied to one or more first logic gate structures wherein the ultimate outputs of the first logic gate structures control multiplexer couplers. The critical speed inputs are supplied to one or more second logic gate structures wherein the ultimate outputs of the second logic gate structures are provided as the input to the multiplexer couplers.
Abstract:
A technique for operating a processor includes translating, using an associated translation lookaside buffer, a first virtual address into a first physical address through a first entry number in the translation lookaside buffer. The technique also includes translating, using the translation lookaside buffer, a second virtual address into a second physical address through a second entry number in the translation lookaside buffer. The technique further includes, in response to the first entry number being the same as the second entry number, determining that the first and second virtual addresses point to the same physical address in memory and reference the same data.
Abstract:
In some embodiments, a data processing system includes a processing unit, a first load/store unit LSU and a second LSU configured to operate independently of the first LSU in single and multi-thread modes. A first store buffer is coupled to the first and second LSUs, and a second store buffer is coupled to the first and second LSUs. The first store buffer is used to execute a first thread in multi-thread mode. The second store buffer is used to execute a second thread in multi-thread mode. The first and second store buffers are used when executing a single thread in single thread mode.
Abstract:
A processor reduces the likelihood of stalls at an instruction pipeline by dynamically extending the size of a full execution queue. To extend the full execution queue, the processor temporarily repurposes another execution queue to store instructions on behalf of the full execution queue. The execution queue to be repurposed can be selected based on a number of factors, including the type of instructions it is generally designated to store, whether it is empty of other instruction types, and the rate of cache hits at the processor. By selecting the repurposed queue based on dynamic factors such as the cache hit rate, the likelihood of stalls at the dispatch stage is reduced for different types of program flows, improving overall efficiency of the processor.
Abstract:
A system and method for power efficient memory caching. Some illustrative embodiments may include a system comprising: a hash address generator coupled to an address bus (the hash address generator converts a bus address present on the address bus into a current hashed address); a cache memory coupled to the address bus (the cache memory comprises a tag stored in one of a plurality of tag cache ways and data stored in one of a plurality of data cache ways); and a hash memory coupled to the address bus (the hash memory comprises a saved hashed address, the saved hashed address associated with the data and the tag). Less than all of the plurality of tag cache ways are enabled when the current hashed address matches the saved hashed addresses. An enabled tag cache way comprises the tag.
Abstract:
An instruction cache employing a cache holding register is provided. When a cache line of instruction bytes is fetched from main memory, the instruction bytes are temporarily stored into the cache holding register as they are received from main memory. The instruction bytes are predecoded as they are received from the main memory. If a predicted-taken branch instruction is encountered, the instruction fetch mechanism within the instruction cache begins fetching instructions from the target instruction path. This fetching may be initiated prior to receiving the complete cache line containing the predicted-taken branch instruction. As long as instruction fetches from the target instruction path continue to hit in the instruction cache, these instructions may be fetched and dispatched into a microprocessor employing the instruction cache. The remaining portion of the cache line of instruction bytes containing the predicted-taken branch instruction is received by the cache holding register. In order to reduce the number of ports employed upon the instruction bytes storage used to store cache lines of instructions, the cache holding register retains the cache line until an idle cycle occurs in the instruction bytes storage. The same port ordinarily used for fetching instructions is then used to store the cache line into the instruction bytes storage. In one embodiment, the instruction cache prefetches a succeeding cache line to the cache line which misses. A second cache holding register is employed for storing the prefetched cache line.