摘要:
This invention simulates program to create a conflict graph of the cache accesses. The conflict graph is used to relay out relocatable functions to minimize cache conflict misses where conflicting functions map to the same portion of the cache. The conflict graph includes a vertex for each function and an edge between functions having a weight corresponding to a conflict amount. This conflict graph enables a layout of functions to minimize the number of conflicting items that map to the same location in the cache weighted by the degree of conflict encoded by the edges in the graph.
摘要:
A method and apparatus are disclosed for allocating functional units in a multithreaded very large instruction word (VLIW) processor. The present invention combines the techniques of conventional very long instruction word architectures and conventional multithreaded architectures to reduce execution time within an individual program, as well as across a workload. The present invention utilizes instruction packet splitting to recover some efficiency lost with conventional multithreaded architectures. Instruction packet splitting allows an instruction bundle to be partially issued in one cycle, with the remainder of the bundle issued during a subsequent cycle. The allocation hardware assigns as many instructions from each packet as will fit on the available functional units, rather than allocating all instructions in an instruction packet at one time. Those instructions that cannot be allocated to a functional unit are retained in a ready-to-run register. On subsequent cycles, instruction packets in which all instructions have been issued to functional units are updated from their thread's instruction stream, while instruction packets with instructions that have been held are retained. The functional unit allocation logic can then assign instructions from the newly-loaded instruction packets as well as instructions that were not issued from the retained instruction packets.
摘要:
A method and apparatus are disclosed for allocating functional units in a multithreaded very large instruction word (VLIW) processor. The present invention combines the techniques of conventional very long instruction word (VLIW) architectures and conventional multithreaded architectures to reduce execution time within an individual program, as well as across a workload. The present invention utilizes instruction packet splitting to recover some efficiency lost with conventional multithreaded architectures. Instruction packet splitting allows an instruction bundle to be partially issued in one cycle, with the remainder of the bundle issued during a subsequent cycle. There are times, however, when instruction packets cannot be split without violating the semantics of the instruction packet assembled by the compiler. A packet split identification bit is disclosed that allows hardware to efficiently determine when it is permissible to split an instruction packet. The split bit informs the hardware when splitting is prohibited. The allocation hardware assigns as many instructions from each packet as will fit on the available functional units, rather than allocating all instructions in an instruction packet at one time, provided the split bit has not been set. Those instructions that cannot be allocated to a functional units are retained in a ready-to-run register. On subsequent cycles, instruction packets in which all instructions have been issued to functional units are updated from their thread's instruction stream, while instruction packets with instructions that have been held are retained. The functional unit allocation logic can then assign instructions from the newly-loaded instruction packets as well as instructions that were not issued from the retained instruction packets.
摘要:
In general, and in a form of the present invention, a method is provided for reducing execution time of a program executed on a digital system by improving hit rate in a cache of the digital system. This is done by determining cache performance during execution of the program over a period of time as a function of address locality, and then identifying occurrences of cache conflict between two program modules. One of the conflicting program modules is then relocated so that cache conflict is eliminated or at least reduced. In one embodiment of the invention, a 2D plot of cache operation is provided as a function of address versus time for the period of time. A set of cache misses having temporal locality and spatial locality is identified as a horizontally arranged grouping of pixels at a certain address locality having a selected color indicative of a cache miss. Cache conflict is determined by overlying an interference grid or a shadow grid on the plot responsive to the address locality such that a plurality of lines are displayed at other address localities that map to the same region in cache as the first address locality. In order to relocate a program module, a relocation parameter is provided to a linker to cause the program module to be linked at a different address.
摘要:
A method and apparatus are disclosed for allocating functional units in a multithreaded very large instruction word (VLIW) processor. The present invention combines the techniques of conventional VLIW architectures and conventional multithreaded architectures to reduce execution time within an individual program, as well as across a workload. The present invention utilizes a compiler to detect parallelism. The disclosed multithreaded VLIW architecture exploits program parallelism by issuing multiple instructions, in a similar manner to single threaded VLIW processors, from a single program sequencer, and also supports multiple program sequencers, as in simultaneous multithreading. Instructions are allocated to functional units to issue multiple VLIW instructions to multiple functional units in the same cycle. The allocation mechanism of the present invention occupies a pipeline stage just before arguments are dispatched to functional units. The allocate stage determines how to group the instructions together to maximize efficiency, by selecting appropriate instructions and assigning the instructions to the FUs. The criteria for selection are thread priority or resource availability or both. Under the thread priority criteria, different threads can have different priorities. The allocate stage selects and forwards the packets (or instructions from packets) for execution belonging to the thread with the highest priority according to the priority policy implemented. Under the resource availability criteria, a packet (having up to K instructions) can be allocated only if the resources (functional units) required by the packet are available for the next cycle. Functional units report their availability to the allocate stage.
摘要:
A method and apparatus are disclosed for releasing functional units in a multithreaded very large instruction word (VLIW) processor. The functional unit release mechanism can retrieve the capacity lost due to multiple cycle instructions. The functional unit release mechanism of the present invention permits idle functional units to be reallocated to other threads, thereby improving workload efficiency. Instruction packets are assigned to functional units, which can maintain their state, independent of the issue logic. Each functional unit has an associated state machine (SM) that keeps track of the number of cycles that the functional unit will be occupied by a multiple-cycle instruction. Functional units do not reassign themselves as long as the functional unit is busy. When the instruction is complete, the functional unit can participate in functional unit allocation, even if other functional units assigned to the same thread are still busy. The functional unit release approach of the present invention allows the functional units that are not associated with a multiple-cycle instruction to be allocated to other threads while the blocked thread is waiting, thereby improving throughput of the multithreaded VLIW processor. Since the state is associated with each functional unit separately from the instruction issue unit, the functional units can be assigned to threads independently of the state of any one thread and its constituent instructions.
摘要:
An apparatus and method in digital processing provides a simple and efficient way of communicating parameters from a parent thread to child thread with two step thread creation. The method comprising the steps of: allocating hardware context for the child thread; enabling the parent thread to execute other instructions wherein parent thread register writes update both parent and child architectural registers; and spawning the child thread. In essence, the parent thread sends parameters to the child by writing to the parent's registers prior to spawning of the child thread.
摘要:
A history-based prefetch cache which includes a time queue. The time queue correlates past events with cache misses in a microprocessor. The time queue is set to N cycles, N being a predetermined, arbitrary or programmable amount. The prefetch cache is a prefetch target buffer which receives inputs from a time queue and a cache and determines if an event is present in the cache. If an address is not present in the cache it is prefetched based on past events and inserted into the prefetch target buffer so that the microprocessor will not miss it the next time.