摘要:
The invention is directed to a cache memory system in a data processor including a virtual cache memory, a physical cache memory, a virtual to physical translation buffer, a physical to virtual backmap, an Old-PA pointer and a lockout register. The backmap implements invalidates by clearing the valid flags in virtual cache memory. The Old-PA pointer indicates the backmap entry to be invalidated after a reference misses in the virtual cache. The physical address for data written to virtual cache memory is entered to Old-PA pointer by the translation buffer. The lockout register arrests all references to data which may have synonyms in virtual cache memory. The backmap is also used to invalidate any synonyms.
摘要:
Methods and apparatus relating to a hardware move elimination and/or next page prefetching are described. In some embodiments, a logic may provide hardware move eliminations based on stored data. In an embodiment, a next page prefetcher is disclosed. Other embodiments are also described and claimed.
摘要:
A processor includes a cache memory with a data storage unit operating at a first clock frequency, and a tag unit and hit/miss logic operating at a second clock frequency different than the first clock frequency. The data storage unit may advantageously be clocked faster than the tag unit and hit/miss logic, such as two times (2×) faster. The processor may also include a replay mechanism for recovering from data speculation when the hit/miss logic or the tag unit signals that speculated data from the higher clocked data storage unit is, in fact, invalid.
摘要:
The present invention provides a method and apparatus for controlling a processing priority assigned alternately to a first thread and a second thread in a multithreaded processor to prevent deadlock and livelock problems between the first thread and the second thread. In one embodiment, the processing priority is initially assigned to the first thread for a first duration. It is then determined whether the first duration has expired in a given processing cycle. If the first duration has expired, the processing priority is assigned to the second thread for a second duration.
摘要:
A mechanism for executing computer instructions in parallel includes a compiler for generating and grouping instructions into a plurality of sets of instructions to be executed in parallel, each set having a unique identification. A computer system having a real state and a speculative state executes the sets in parallel, the computer system executing a particular set of instructions in the speculative state if the instructions of the particular set have dependencies which can not be resolved until the instructions are actually executed. The computer system generates speculative data while executing instructions in the speculative state. Logic circuits are provided to detect any exception conditions which occur while executing the particular set in the speculative state. If the particular set is subject to an exception condition, the instructions of the set are re-executed to resolve the exception condition, and to incorporate the speculative data in the real state of the computer system.
摘要:
A microprocessor having a replay architecture with an execution core for performing data speculation in executing an instruction, a delay unit for making a copy of the instruction and holding the copy for as long as the instruction takes to execute, and a checker for determining whether the data speculation was bogus. If the data speculation was bogus, the delay unit and its buffer send the copy of the instruction back to the execution core for re-execution. A multiplexor coupled to the input of the execution core selects for execution among original instructions from the instruction cache, replay instructions from the delay unit, and manufactured instructions from various other units such as the TLB or tag units, according to a priority scheme.
摘要:
A method and device for receiving data in a synchronous communication system. Data can be accurately transferred between two subsystems in a synchronous system even where the clock skew and propagation delay between the two subsystems is unlimited. The receiving subsystem is initialized to ensure synchronous data transfer over a theoretically infinite range. The transmitting subsystem transmits data and a forwarded clock to the receiving subsystem. Data is captured in three state devices arranged in parallel to eliminate minimum delay requirements and to expand data valid time. The captured data is then aligned to the clock of the receiving subsystem by controlling a multiplexer which selects the proper state device output to pass to another state device for alignment to the receiving subsystem's clock. The multiplexer is controlled by a circuit which monitors the capturing of the incoming data and determines the correct state device output to select for proper data alignment.
摘要:
A cache memory system in a data processor that has a main memory and a processing unit, the cache memory system including a virtually addressed storage cache. This virtually addressed storage cache is connected to the main memory for storing in storage cache locations preselected portions of data from the main memory. Each cache location includes a valid indicator to indicate the data in the cache location is current. A translation buffers is coupled to the storage cache, and translates a virtual address to a physical address. The backmap is coupled to the storage cache and the translation buffer, and invalidates data in the storage cache by generating an invalidate index to the cache location at which a valid indicator is to be cleared only when data in the storage cache is to be invalidated. The cache memory system includes at least one lockout register for storing addresses for data which may exist in more than one storage location. The backmap invalidates all copies of the data in the storage cache after every reference to data in the storage cache using an address in the lockout register.
摘要:
Data can be accurately transmitted between two subsystems even if the clock skew between the two subsystems is larger than one clock cycle by the method of the invention. In one embodiment data is loaded into N state devices in the sending subsystem while the receiver recovers data from the sending state devices in rotation with an N input multiplexer. Another embodiment forwards a clock signal from the sending subsystem along with a data vector of N state signals for recovery by a pair of state devices capturing data on the rising and falling edges of the forwarded clock. A further embodiment achieves double bandwidth by forwarding two clock signals.
摘要:
Systems, apparatuses, and methods for a hardware and software system to automatically decompose a program into multiple parallel threads are described. For example, a method according to one embodiment comprises: analyzing a single-threaded region of executing program code, the analysis including identifying dependencies within the single-threaded region; determining portions of the single-threaded region of executing program code which may be executed in parallel based on the analysis; assigning the portions to two or more parallel execution tracks; and executing the portions in parallel across the assigned execution tracks.