Abstract:
A method and a system in a data processing system for managing registers in a register array wherein the data processing system has M architected registers and the register array has greater than M registers. A first physical register address is selected from a group of available physical register addresses in a renamed table in response to dispatching a register-modifying instruction that specifies an architected target register address. The architected target register address is then associated with the first physical register address, and a result of executing the register-modifying instruction is stored in a physical register pointed to by the first physical register address. In response to completing the register-modifying instruction, the first physical address in the rename table is exchanged with a second physical address in a completion renamed table, wherein the second physical address is located in the completion rename table at a location pointed to by the architected target register address. Therefore, upon instruction completion, the completion rename table contains pointers that map architected register addresses to physical register addresses. The rename table maps architected register addresses to physical register addresses for instructions currently being executed, or for instructions that have “finished” and have not yet been “completed.” Bits indicating the validity of an association between an architected register address and a physical register address are stored before instructions are speculatively executed following an unresolved conditional branch.
Abstract:
A computer processing unit is provided that includes an apparatus for generating an address of the next instruction to be completed. The apparatus includes a first table for storing a plurality of entries each corresponding to a dispatched instruction, each entry comprising an identifier that identifies the corresponding instruction and a status bit that indicates if the corresponding instruction is completed; a second table for storing a plurality of entries each corresponding to dispatched branch instructions, each entry comprising the same identifier stored in the first table, a target address of the dispatched branch instruction and a resolution status field that indicates at least if the corresponding branch instruction has been resolved taken or has been resolved not taken; program counter update logic that, in each machine cycle, updates a program counter to store and output the address of the next instruction to be completed according to the entries stored in the first table and the second table. Because the first and second tables employ efficient identification tags to identify instructions that modify the control flow of the execution pipeline and the target address of such instructions, the computer processing unit of the present invention need not store the full address of each instruction in the execution pipeline to update the program counter as is conventional, and thus saves real estate that may be used for other circuitry.
Abstract:
A fast adder/subtracter using a decoder and shifting function instead of conventional full-adders is disclosed. The circuit is optimized for the addition of multiple operands up to 4-5 binary bits in magnitude. Using this method a subtraction operation can be performed at no added cost with respect to addition (compared to the conventional method requiring complementing one of the operands). Addition and subtraction of multiple operands is implemented by simple multiple shift operations. The multiple shift operations can be implemented as a chain of series NMOS pulldown devices with a precharged load providing considerable speed advantage over conventional solutions. Fast overflow detection may be implemented by or-ing the higher order bits in the shifter.
Abstract:
A data processing apparatus for simultaneously reading out groups of two or more contiguous, variable length instructions from memory, and for decoding the group of variable length instructions in parallel. The data processing apparatus has a memory containing at least first, second, and third contiguous instructions, and at least first, second, and third read ports for receiving starting addresses and for reading out the instructions from the memory. A next instruction pointer supplies the starting address of the first instruction to the first read port, receives the first instruction, decodes the length of the first instruction, determines the starting address of the second instruction, supplies the starting address of the second instruction to the first read port, receives the second instruction, decodes the length of the second instruction, and determines the starting address of the third instruction. All of these operations are performed in one cycle time. An instruction pointer queue receives and stores the starting addresses of at least the second and third instructions, and supplies the starting addresses to the second and third read ports for simultaneously reading out the second and third instructions from the memory. First and second instruction decoders receive and simultaneously decode the second and third instructions.
Abstract:
A processor is defined by a new architectural feature called a Backing Register File, where a Backing Register File is a set of randomly accessible registers capable of holding values, and further are directly connected to the processor's register files. The processor's register files are in turn connected to the processor's execution units. A Backing Register File is visible and controllable by users, allowing them to make use of a larger local address space increasing execution unit throughput thereby, while not changing the size of the processor's register files themselves.
Abstract:
A central processing unit (CPU) in a computer that permits speculative parallel execution of more than one instruction thread. The CPU uses Fork-Suspend instructions that are added to the instruction set of the CPU, and are inserted in a program prior to run-time to delineate potential future threads for parallel execution. The CPU has an instruction cache with one or more instruction cache ports, a bank of one or more program counters, a bank of one or more dispatchers, a thread management unit that handles inter-thread communications and discards future threads that violate dependencies, a set of architectural registers common to all threads, and a scheduler that schedules parallel execution of the instructions on one or more functional units in the CPU.
Abstract:
A processor comprising a new architectural feature called a Register Domain, where a Register Domain has a register file, at least one execution unit, and coupling circuitry between the two. A processor will typically have a plurality of Register Domains, and Register Domains may have different types of execution units within them. Individual Register Domains will be visible to a user.
Abstract:
A memory access scheme achieved using a memory address register and a register-indirect memory accessing mode eliminates write back collisions, long cycle time, and enhances system performance. During memory address generation operations, an arithmetic-logic unit (ALU) generates memory addresses from data in a general purpose register (GPR). Then, the memory addresses are written back to the GPR and a memory address register (MAR). During memory access operations, the MAR is accessed for the memory addresses to access a memory. Two approaches are provided. In a first approach, use of the MAR during the memory access operations is explicit. In a second approach, use of the MAR during the memory access operations is transparent. According to the second approach, a controller is provided to validate the MAR during the memory access operations.
Abstract:
An on-chip VLSI cache architecture including a single-port, last-select, cache array organized as an n-way set-associative cache (having n congruence classes) including a plurality of functionally integrated units on-chip in addition to the cache array and including a normal read/write CPU access function which provides an architectural organization for allowing the chip to be used in (1) a fast, "late-select" operation which may be provided with any desired degree of set-associativity while achieving an effective one-cycle write operation, and (2) a cache reload function which provides a highly parallel store-back and reload operation to substantially reduce the reload time, particularly for a store-in cache organization. The cache chip organization and architecture provide a late-select cache having a nearly transparent, multiple word reload by incorporating a Cache-Reload Buffer, a store-back buffer and a load-through function all included on the cache array chip for reloading, and a delayed write-enable for achieving an effective one-cycle write operation. Two separate decoder functions are integrated on the chip, one for cache access for normal read/write operations to and from the CPU and one for cache reload which also provides interim access to data which has been transferred out of main memory to the chip but not yet reloaded into the cache array. These two decoders provide for different accessing modes as required of the CPU or main memory operations.
Abstract:
Performance of a VLSI processor of the reduced instruction set computer (RISC) type is enhanced by executing two instructions simultaneously in the two execution units of the processor. There is very little increase in the cost of hardware. Three embodiments are presented with different cost and performance capabilities. The first embodiment has an instruction input to an instruction buffer (10) and two sets of control ROSs (40 and 42) and control registers (64 and 65). The control ROS and control register which is chosen depends on which instruction execution unit is to execute the instruction. Data inputs to the execution units is from a register file (48) which has an additional pair of outputs (51) and (53) that provide the data paths for simultaneous execution of instructions by the execution units. Execution unit I has an arithmetic and logic unit (ALU) (24), while execution unit II has a rotate (26) and mask generator (31). Load balancing between the two execution units can be performed by adding a multiplier (60) and divider (62) to execution unit II. In the second embodiment, additionally, load balancing is achieved by incorporating an adder (78) into execution unit II. The adder (78) is used to perform address calculations to speed up the load, store and branch instructions. In the third embodiment, an additional ALU (90) is added to execution unit II to allow the instruction processing to be further balanced between the two execution units.