Abstract:
These and other aspects of the present invention will be better described with reference to the Detailed Description and the accompanying figures. A method and apparatus for out-of-order processing of packets using linked lists is described. In one embodiment, the method includes receiving packets in a global order, the packets being designated for different ones of a plurality of reorder contexts. The method also includes storing information regarding each of the packets in a shared reorder buffer. The method also includes for each of the plurality of reorder contexts, maintaining a reorder context linked list that records the order in which those of the packets that were designated for that reorder context and that are currently stored in the shared reorder buffer were received relative to the global order. The method also includes completing processing of at least certain of the packets out of the global order and retiring the packets from the shared reorder buffer out of the global order for at least certain of the packets.
Abstract:
A microprocessor includes a first register file including a plurality of multimedia registers defined to store operands for multimedia instructions and a second register file including a plurality of floating point registers defined to store operands for floating point instructions. The multimedia registers and floating point registers are mapped to the same logical storage according to the instruction set employed by the microprocessor. In order to maintain predefined behavior when a floating point instruction reads a register most recently updated by a multimedia instruction or vice versa, the microprocessor provides for synchronization of the first and second register files between executing a set of one or more multimedia instructions and a set of one or more floating point instructions (where either set may be prior to the other in program order and the order affects which direction copying of the contents is performed, i.e. first register file to second register file or vice versa). The predefined behavior in the above mentioned circumstances is thereby maintained. The microprocessor supports an empty state instruction. If the empty state instruction is included between the set of one or more multimedia instructions and the set of one or more floating point instructions in a code sequence, the microprocessor inhibits the register file synchronization. In one embodiment including the x86 instruction set, the empty state instruction performs the same set of actions as the EMMS instruction in addition to the above mentioned features.
Abstract:
A multimedia extension unit (MEU) is provided for performing various multimedia-type operations. The MEU may be coupled either through a coprocessor bus or a local CPU bus to a conventional processor. The MEU employs vector registers, a vector ALU, and an operand routing unit (ORU) to perform a maximum number of the multimedia operations within as few instruction cycles as possible. Complex algorithms are readily performed by arranging operands upon the vector ALU in accordance with the desired algorithm flowgraph. The ORU aligns the operands within partitioned slots or sub-slots of the vector registers using vector instructions unique to the MEU. At the output of the ORU, operand pairs from vector source or destination registers may be easily routed and combined at the vector ALU. The vector instructions employ special load/store instructions in combination with numerous operational instructions to carry out concurrent multimedia operations on the aligned operands. In one embodiment, an arithmetic logic unit may be partitioned into at least two logic portions. A first logic portion may be coupled to receive a first operand from a fixed slot of a first register and a second operand from any slot of a second register. A second logic portion may be coupled to receive a third operand from a fixed slot of the first register and a fourth operand from any slot of the second register. The first logic portion may perform an arithmetic operation dissimilar from the second logic portion.
Abstract:
A multimedia extension unit (MEU) is provided for performing various multimedia-type operations. The MEU can be coupled either through a coprocessor bus or a local CPU bus to a conventional processor. The MEU employs vector registers, a vector ALU, and an operand routing unit (ORU) to perform a maximum number of the multimedia operations within as few instruction cycles as possible. Complex algorithms are readily performed by arranging operands upon the vector ALU in accordance with the desired algorithm flowgraph. The ORU aligns the operands within partitioned slots or sub-slots of the vector registers using vector instructions unique to the MEU. At the output of the ORU, operand pairs from vector source or destination registers can be easily routed and combined at the vector ALU. The vector instructions employ special load/store instructions in combination with numerous operational instructions to carry out concurrent multimedia operations on the aligned operands. In one embodiment multiple ALUs may each receive one operand from a fixed source register slot location, where the fixed slot location may be different for each ALU. The operand routing may provide another operand from any source register slot location for another input to each respective ALU.
Abstract:
An instruction decoder includes an emulation code sequencer and emulation code ROM for handling various instructions. The emulation code ROM includes a sequence of operations (Op) and an operation sequencing control code (OpSeq). Branch instructions such as conditional branch instructions may be encoded into the emulation code ROM so that a second branch, in combination with the branching operation controlled by the OpSeq code, is applied to an operation code sequence. Two-way branching permits flexible branching to locations within the emulation code ROM so that memory capacity is conserved. A superscalar microprocessor includes an instruction decoder having an emulation code control circuit and an emulation ROM which emulates the function of a logic instruction decoder. The emulation code ROM is arranged as a matrix of multiple-operation (Op) units with each multiple-Op unit including a control field that points to a next location in the emulation code ROM. In one embodiment, the emulation code ROM is arranged to include a plurality of four-Op units, called Op quads, with each Op quad including a sequencing control field, called an OpSeq field.
Abstract:
A multimedia extension unit (MEU) is provided for performing various multimedia-type operations. The MEU can be coupled either through a coprocessor bus or a local central processing unit (CPU) bus to a conventional processor. The MEU employs vector registers, a vector arithmetic logic unit (ALU), and an operand routing unit (ORU) to perform a maximum number of the multimedia operations within as few instruction cycles as possible. Complex algorithms are readily performed by arranging operands upon the vector ALU in accordance with the desired algorithm flowgraph. The ORU aligns the operands within partitioned slots or sub-slots of the vector registers using vector instructions unique to the MEU. At the output of the ORU, operand pairs from vector source or destination registers can be easily routed and combined at the vector ALU. The vector instructions employ special load/store instructions in combination with numerous operational instructions to carry out concurrent multimedia operations on the aligned operands.
Abstract:
The present invention provides for the updating of both the instructions in a branch prediction cache and instructions recently provided to an instruction pipeline from the cache when an instruction being executed attempts to change such instructions ("Store-Into-Instruction-Stream"). The branch prediction cache (BPC) includes a tag identifying the address of instructions causing a branch, a record of the target address which was branched to on the last occurrence of each branch instruction, and a copy of the first several instructions beginning at this target address. A separate instruction cache is provided for normal execution of instructions, and all of the instructions written into the branch prediction cache from the system bus must also be stored in the instruction cache. The instruction cache monitors the system bus for attempts to write to the address of an instruction contained in the instruction cache. Upon such a detection, that entry in the instruction cache is invalidated, and the corresponding entry in the branch prediction cache is invalidated. A subsequent attempt to use an instruction in the branch prediction cache which has been invalidated will detect that it is not valid, and will instead go to main memory to fetch the instruction, where it has been modified.
Abstract:
An improved branch prediction cache (BPC) scheme that utilizes a hybrid cache structure. The BPC provides two levels of branch information caching. The fully associative first level BPC is a shallow but wide structure (36 32-byte entries), which caches full prediction information for a limited number of branch instructions. The second direct mapped level BPC is a deep but narrow structure (256 2-byte entries), which caches only partial prediction information, but does so for a much larger number of branch instructions. As each branch instruction is fetched and decoded, its address is used to perform parallel look-ups in the two branch prediction caches.
Abstract:
An improved branch prediction cache (BPC) scheme that utilizes a hybrid cache structure. The BPC provides two levels of branch information caching. The fully associative first level BPC is a shallow but wide structure (36 32-byte entries), which caches full prediction information for a limited number of branch instructions. The second direct mapped level BPC is a deep but narrow structure (256 2-byte entries), which caches only partial prediction information, but does so for a much larger number of branch instructions. As each branch instruction is fetched and decoded, its address is used to perform parallel look-ups in the two branch prediction caches.
Abstract:
A semiconductor memory array with Built-in Self-Repair (BISR) includes redundancy circuits associated with failed row address stores to drive redundant row word lines, thereby obviating the supply and normal decoding of a substitute addresses. NOT comparator logic compares a failed row address generated and stored by BISR circuits to a row address supplied to the memory array. A TRUE comparator configured in parallel with the NOT comparator simultaneously compares defective row address signal to the supplied row address. Since NOT comparison is performed quickly in dynamic logic without setup and hold time constraints, timing impact on a normal (non-redundant) row decode path is negligible, and since TRUE comparison, though potentially slower than NOT comparison, itself identifies a redundant row address and therefore need not employ an N-bit address to selected word-line decode, redundant row addressing is rapid and does not adversely degrade performance of a self-repaired semiconductor memory array. By providing redundancy handling at the predecode circuit level, rather than at a preliminary address substitution stage, access times to a BISR memory array in accordance with the present invention are improved.