摘要:
A microprocessor with a floating point unit configured to rapidly execute floating point load control word (FLDCW) type instructions in an out of program order context is disclosed. The floating point unit is configured to schedule instructions older than the FLDCW-type instruction before the FLDCW-type instruction is scheduled. The FLDCW-type instruction acts as a barrier to prevent instructions occurring after the FLDCW-type instruction in program order from executing before the FLDCW-type instruction. Indicator bits may be used to simplify instruction scheduling, and copies of the floating point control word may be stored for instruction that have long execution cycles. A method and computer configured to rapidly execute FLDCW-type instructions in an out of program order context are also disclosed.
摘要:
An apparatus and method for superforwarding load operands in a microprocessor are provided. An execution unit in a microprocessor is configured to receive a load instruction and a subsequent instruction. If the load instruction corresponds to a simple load instruction, a destination operand of the load instruction can be superforwarded to a subsequent instruction if the subsequent instruction specifies a source operand that depends on the destination operand of the load instruction. The subsequent instruction is not required to wait until a load instruction executes or completes and can be scheduled and/or executed prior to or at the same time as the load instruction. Consequently, latencies associated with operand dependencies may be reduced.
摘要:
A microprocessor configured to dynamically switch its floating point load pipeline length from one stage in length to more than one stage in length is disclosed. The microprocessor may perform normal loads and detect denormal loads in a single clock cycle. The microprocessor temporarily stores each scheduled floating point instruction in a reissue buffer for at least one clock cycle. When a denormal load instruction is detected, the microprocessor is configured to add one or more stages to the floating point load pipeline to allow the denormal value to complete the conversion to an internal format. The longer pipeline is then used for all loads that follow the denormal load until there is an idle clock cycle or an abort occurs. At that point, the pipeline reverts back to its original shorter state. In addition, the microprocessor may be configured to cancel instructions scheduled assuming the denormal load would take only one clock cycle to complete. The canceled instruction is then “replayed” during a later clock cycle from the reissue buffer. A method for performing denormal loads and a computer system are also disclosed.
摘要:
A microprocessor with a floating point unit configured to efficiently allocate multi-pipeline executable instructions is disclosed. Multi-pipeline executable instructions are instructions that are not forced to execute in a particular type of execution pipe. For example, junk ops are multi-pipeline executable. A junk op is an instruction that is executed at an early stage of the floating point unit's pipeline (e.g., during register rename), but still passes through an execution pipeline for exception checking. Junk ops are not limited to a particular execution pipeline, but instead may pass through any of the microprocessor's execution pipelines in the floating point unit. Multi-pipeline executable instructions are allocated on a per-clock cycle basis using a number of different criteria. For example, the allocation may vary depending upon the number of multi-pipeline executable instructions received by the floating point unit in a single clock cycle.
摘要:
A microprocessor configured to rapidly execute floating point store status word (FSTSW) type instructions that are immediately preceded by floating point compare (FCOM) type instructions is disclosed. FCOM-type instructions are modified to store their results to an architectural floating point status word and a temporary destination register. If an FSTSW-type instruction is detected immediately following an FCOM-type instruction, then the FSTSW-type instruction is transformed into a special fast floating point store status word (FSTSWEF) instruction. Unlike the FSTSW-type instruction, which is serializing and negatively impacts performance, the FSTSWEF instruction is not serializing and allows execution to continue without undue serialization. A computer system and method for rapidly executing FSTSW instructions immediately preceded by FCOM-type instructions are also disclosed.
摘要:
A microprocessor includes one or more registers which are architecturally defined to be used for at least two data formats. In one embodiment, the registers are the floating point registers defined in the x86 architecture, and the data formats are the floating point data format and the multimedia data format. The registers actually implemented by the microprocessor for the floating point registers use an internal format for floating point data. Part of the internal format is a classification field which classifies the floating point data in the extended precision defined by the x86 microprocessor architecture. Additionally, a classification field encoding is reserved for multimedia data. As the microprocessor begins execution of each multimedia instruction, the classification information of the source operands is examined to determine if the data is either in the multimedia class, or in a floating point class in which the significand portion of the register is the same as the corresponding significand in extended precision. If so, the multimedia instruction executes normally. If not, the multimedia instruction is faulted. Similarly, as the microprocessor begins execution of each floating point instruction, the classification information of the source operands is examined. If the data is classified as multimedia, the floating point instruction is faulted. A microcode routine is used to reformat the data stored in at least the source registers of the faulting instruction into a format useable by the faulting instruction. Subsequently, the faulting instruction is re-executed.
摘要:
A microprocessor with a floating point unit configured to rapidly execute floating point compare (FCOMI) type instructions that are followed by floating point conditional move (FCMOV) type instructions is disclosed. FCOMI-type instructions, which normally store their results to integer status flag registers, are modified to store a copy of their results to a temporary register located within the floating point unit. If an FCMOV-type instruction is detected following an FCOMI-type instruction, then the FCMOV-type instruction's source for flag information is changed from the integer flag register to the temporary register. FCMOV-type instructions are thereby able to execute earlier because they need not wait for the integer flags to be read from the integer portion of the microprocessor. A computer system and method for rapidly executing FCOMI-type instructions followed by FCMOV-type instructions are also disclosed.
摘要:
A processor includes a store queue configured to detect a hit on a store queue entry for a load being executed by the processor, and to forward data from the store queue entry to provide a result for the load. The store queue data is provided to the data cache, along with an indication of how much data is being provided (e.g. byte enables). The data cache may then fill in any additional data accessed by the load from cache data, and provide a load result. Additionally, the store queue is configured to detect if more than one store queue entry is hit (i.e. that more than one store within the store queue updates at least one byte accessed by the load), referred to as a multimatch. If a multimatch is detected, the store queue retries the load. Subsequently, the load may be reexecuted and may not multimatch (as entries are deleted upon completion of the corresponding stores). The load may complete when it does not multimatch. In one embodiment, the store queue independently detects hits on the upper and lower portions of each store queue entry (e.g. doubleword portions) and forwards from the upper and lower portions independently. Thus, a load may hit one store queue entry for the lower portion of the data accessed by the load and a different store queue entry for the upper portion of the data accessed by the load without multimatch detection.
摘要:
A processor includes a store queue and a store queue number assignment circuit. The store queue number assignment circuit assigns store queue numbers to stores, and operates upon instruction operations prior to the instruction operations reaching a point in the pipeline of the processor at which out of order instruction processing begins. Thus, store queue entries may be reserved for stores according to the program order of the stores. Additionally, in one embodiment, the store queue number identifying the youngest store represented in the store queue may be assigned to loads. In this manner, loads may determine which stores in the store queue are older or younger than the load based on relative position within the store queue. Checking for store queue hits may be qualified with the entries between the head of the store queue and the entry indicated by the load's store queue number. In one particular embodiment, the store queue number may include an additional “toggle” bit which is toggled each time the assignment of store queue numbers reaches the maximum store queue entry and wraps to zero. If the toggle bit of the store in the store queue entry identified by the load's store queue number differs from the toggle bit of the load's store queue number, than the store queue entry has been reassigned to a store younger than the load.
摘要:
A dynamic memory allocation routine maintains an allocation size cache which records the address of a most recently allocated memory block for each different size of memory block that has been allocated. Upon receiving a dynamic memory allocation request, the dynamic memory allocation routine determines if the requested size is equal to one of the sizes recorded in the allocation size cache. If a matching size is found, the dynamic memory allocation routine attempts to allocate a memory block contiguous to the most recently allocated memory block of that matching size. If the contiguous memory block has been allocated to another memory block, the dynamic memory allocation routine attempts to reserve a reserved memory block having a size which is a predetermined multiple of the requested size. The requested memory block is then allocated at the beginning of the reserved memory block. By reserving the reserved memory block, the dynamic memory allocation routine may increase the likelihood that subsequent requests for memory blocks having the requested size can be allocated in contiguous memory locations.